Lighting has a strong influence on visual appearance, yet understanding and representing lighting in images remains notoriously difficult. Various lighting representations exist, such as environment maps, irradiance, spherical harmonics, or text, but they are incompatible, which limits cross-modal transfer. We thus propose UniLight, a joint latent space as lighting representation, that unifies multiple modalities within a shared embedding. Modality-specific encoders for text, images, irradiance, and environment maps are trained contrastively to align their representations, with an auxiliary spherical-harmonics prediction task reinforcing directional understanding. Our multi-modal data pipeline enables large-scale training and evaluation across three tasks: lighting-based retrieval, environment-map generation, and lighting control in diffusion-based image synthesis. Experiments show that our representation captures consistent and transferable lighting features, enabling flexible manipulation across modalities.
Overview of our embedding approach. Image- and text-based lighting modalities are first embedded using DINOv2 and Qwen3, respectively. All modalities are then processed by lightweight fusion modules which are trained contrastively to output into our joint latent space, UniLight. To improve latent-space coherence, a linear-probing head estimates spherical-harmonics (SH) coefficients from the latents, and a dedicated loss aligns these coefficients to ground-truth coefficients extracted from the environment map.
UniLight learns a unified embedding space that enables cross-modal retrieval between different lighting representations. Given a query in one modality (e.g., an RGB image or text description), the system retrieves the most similar samples from a different modality (e.g., environment maps). Click on the tabs below to explore different retrieval pairs and see the top-3 most similar matches for each query.
Loading retrieval visualization...
UniLight embeddings enable light editing and control in X->RGB image synthesis. The input image is first decomposed to intrinsic maps (depth, normals, albedo), and then relit using an X->RGB diffusion model conditioned on UniLight embeddings, which supports various lighting modalities.
The following is a set of relighting samples where the environment map is rotated interactively. From left to right, it shows the original image, the envmap with different rotations, and the relit image generated by UniLight. Please use the slider to rotate the environment map from -180° to +180°.
@article{zhang2025unilight,
title={UniLight: A Unified Representation for Lighting},
author={Zhang, Zitian and Georgiev, Iliyan and Fischer, Michael and Hold-Geoffroy, Yannick and Lalonde, Jean-Fran{\c{c}}ois and Deschaintre, Valentin},
journal={arXiv preprint arXiv:2512.04267},
year={2025}
}