|
Lighting has a strong influence on visual appearance, yet understanding and representing
lighting in images remains notoriously difficult. Various lighting representations exist,
such as environment maps, irradiance, spherical harmonics, or text, but they are
incompatible, which limits cross-modal transfer. We thus propose UniLight, a joint latent
space as lighting representation, that unifies multiple modalities within a shared
embedding. Modality-specific encoders for text, images, irradiance, and environment maps are
trained contrastively to align their representations, with an auxiliary spherical-harmonics
prediction task reinforcing directional understanding. Our multi-modal data pipeline enables
large-scale training and evaluation across three tasks: lighting-based retrieval,
environment-map generation, and lighting control in diffusion-based image synthesis.
Experiments show that our representation captures consistent and transferable lighting
features, enabling flexible manipulation across modalities.
|