We present additional results complementing the main paper. In particular, we show interactive relighting results,
using pre-computed images generated by our method and baselines.
Hover your mouse over the images to see animated relighting results!
Here we further detail how we obtain the latent shadow mask \(\mathbf{m}_{\text{shw},\downarrow}\) from the full resolution guiding shadow \(\mathbf{m}_{\text{shw}}\). We first downsample the shadow \(\mathbf{m}_{\text{shw}}\) by simple bilinear interpolation. We then binarize the downsampled shadow using a threshold of 0.05 (to include the softer parts of the shadow in the mask). We dilate the mask using a \(3\times 3\) kernel to include the details at the edge, and further multiply by two the mask value at the edge. Finally, we remove from the shadow mask its intersection with the downsampled object mask, to avoid the shadow from leaking in the object shading, in the latent space.
Many possible negative shadow conditioning can be used to enhance the lighting control of the object. In our case, when the shadow is obtained from shadow mapping, we cast a shadow with the light shifted by 180° in azimuth. For the hand drawn shadows, we use a conditioning where no shadow is drawn for the negative branch.
We employ the checkpoint of their X->RGB model finetuned for inpainting rectangular masks, and mask out the bounding box including both the object and shadow region. To estimate intrinsic properties of test images (normals, metallic, roughness, and albedo), we use the RGB->X model. We observe that results are generally overly bright, likely due to a domain gap between the training and evaluation images. To address this, we re-expose the background by a factor of 2 before feeding it to the network, then divide the output by 2, in linear space. We apply the same background preservation strategy as in ZeroComp.
We show quantitative and qualitative results of the RGB-X backbone in sections 4 and 5 of this supplementary material.
Both DiLightNet and Neural Gaffer require an environment map for lighting. We construct such an environment map with radiance \(L_{\text{env}}\) along direction \(\omega\) defined as the sum of a spherical gaussian and a constant term \begin{equation} L_{\text{env}}(\omega) = \mathbf{c}_\text{light} e^{\lambda (\omega \cdot \mathbf{v} - 1)} + \mathbf{c}_\text{amb} \,, \end{equation} where \(\mathbf{c}_\text{light}\) and \(\mathbf{c}_\text{amb}\) are the RGB colors of the light and ambient terms resp., \(\mathbf{v}\) is the dominant light source direction, and \(\lambda\) is the bandwidth. \(\mathbf{c}_\text{amb}\) is obtained by computing the average color over the background image. Panoramas from the Laval Indoor HDR dataset (excluding those in our test set) are used to estimate a single average intensity of the dominant light source, \(k\), which is set to the ratio of the integral of the brightest pixels in the panoramas, divided by the integral over all pixels. We further divide this integral by 2 to consider a single hemisphere and avoid overly bright dominant light sources. The light color is defined by \(\mathbf{c}_\text{light} = k \mathbf{c}_\text{amb}'\), where \(\mathbf{c}_\text{amb}'\) is the normalized ambient color \(\mathbf{c}_\text{amb}\). We found that the bandwidth parameter \(\lambda\) did not advantage any specific method and therefore fixed it to \(\lambda = 300\), which we typically observe for an indoor light.
We extend the quantitative results from tab. 1. First, we compute metrics separately for the foreground (evaluating the fidelity of the object's shading compared to the ground truth) and the background (evaluating the fidelity of the background shadows compared to the ground truth). Then, we include additional results, including SpotLight applied to the RGB-X backbone.
Method | Full image | Foreground only | Background only | |||||||
---|---|---|---|---|---|---|---|---|---|---|
PSNR | SSIM | RMSE | MAE | LPIPS | RMSE | MAE | SI-RMSE | RMSE | MAE | |
DiLightNet | 24.67 | 0.948 | 0.064 | 0.022 | 0.042 | 0.207 | 0.192 | 0.055 | 0.03 | 0.009 |
Neural Gaffer | 28.44 | 0.963 | 0.042 | 0.015 | 0.038 | 0.102 | 0.088 | 0.049 | 0.03 | 0.009 |
IC-Light | 26.87 | 0.959 | 0.054 | 0.019 | 0.04 | 0.153 | 0.13 | 0.062 | 0.03 | 0.009 |
ZeroComp+SDEdit | 26.0 | 0.938 | 0.053 | 0.025 | 0.079 | 0.079 | 0.064 | 0.048 | 0.048 | 0.022 |
SpotLight (no guidance) | 31.69 | 0.976 | 0.029 | 0.011 | 0.029 | 0.086 | 0.073 | 0.046 | 0.017 | 0.006 |
SpotLight (with guidance, ours) | 30.68 | 0.974 | 0.033 | 0.012 | 0.03 | 0.1 | 0.085 | 0.05 | 0.018 | 0.006 |
SpotLight (RGB-X backbone) | 26.56 | 0.955 | 0.051 | 0.018 | 0.042 | 0.176 | 0.155 | 0.06 | 0.02 | 0.007 |
We also extend the results from tab. 2, showing the results of the full image, foreground only, and background only for the latent mask weight β and guidance scale γ in the following two tables. We observe that some changes in parameters, like using no blending (β=0), or using no guidance (γ=1) may provide better quantitative results. However, we observe in our qualitative evaluation and user studies that these changes diminish the level of light control over the object. Our selected parameter combination provides good quantitative performance and adequate lighting control.
Method | Full image | Foreground only | Background only | |||||||
---|---|---|---|---|---|---|---|---|---|---|
PSNR | SSIM | RMSE | MAE | LPIPS | RMSE | MAE | SI-RMSE | RMSE | MAE | |
SpotLight (γ=1, no guidance) | 31.69 | 0.976 | 0.029 | 0.011 | 0.029 | 0.086 | 0.073 | 0.046 | 0.017 | 0.006 | SpotLight (γ=3, with guidance, ours) | 30.68 | 0.974 | 0.033 | 0.012 | 0.03 | 0.1 | 0.085 | 0.05 | 0.018 | 0.006 |
SpotLight (γ=7) | 28.68 | 0.966 | 0.043 | 0.015 | 0.036 | 0.138 | 0.116 | 0.062 | 0.019 | 0.006 |
Method | Full image | Foreground only | Background only | |||||||
---|---|---|---|---|---|---|---|---|---|---|
PSNR | SSIM | RMSE | MAE | LPIPS | RMSE | MAE | SI-RMSE | RMSE | MAE | |
SpotLight (β=0.2) | 29.24 | 0.969 | 0.039 | 0.014 | 0.034 | 0.102 | 0.086 | 0.051 | 0.025 | 0.008 | SpotLight (β=0.05, ours) | 30.68 | 0.974 | 0.033 | 0.012 | 0.03 | 0.1 | 0.085 | 0.05 | 0.018 | 0.006 |
SpotLight (β=0, no blending) | 30.81 | 0.974 | 0.032 | 0.012 | 0.029 | 0.098 | 0.083 | 0.05 | 0.018 | 0.006 |
We present 20 additional randomly-selected results on the "user-controlled" dataset, extending the
qualitative results from fig. 5.
In this case, we rendered 8 light directions instead
of 5 (as used for the user study), in order to also show results where the shadow is behind the object.
Move your mouse from left to right over the images to see the light direction change.
In addition to the overall realism and lighting control user studies, we conducted two user studies to disentangle the shadow from the shading realism. To evaluate shading realism, the shadow needs to be fixed for all methods. To do so, for all methods, we replace the region outside the object mask by the output of SpotLight, containing a refined shadow. To evaluate shadow realism, we do the opposite. We replace the region within the object mask by the output of SpotLight, containing the shaded object. We show the results of these two user studies below.
In addition to the lighting control user study in fig. 4b, that compares our method to the ZeroComp+SDEdit, we show here that disabling the guidance term yields a similar degradation in perceptual scores for the lighting control setting.
To use the Thurstone case V Law of comparative judgement, we need a fixed set of comparisons, shown to all observers. This set of comparison is randomly sampled and reused for all users. To limit bias, the method left/right ordering is randomized, and the order of comparisons was randomized for each observer. All user studies were conducted using different sets of observers, to avoid bias.
For each study, three sentinels images were randomly placed in the questions. Incorrectly selecting one of the sentinel answers led to exclusion. In the realism user studies, the sentinels corresponded to the object set to be full white, therefore having highly unrealistic shading. For the light control user studies, the sentinels corresponded to an object where the object's lighting doesn't change as the shadow moves.
Here, we show the details of each user study. "Filtered observers" represents the number of observers that ended up contributing to the scores, which were selected since they didn't click the sentinels. Also note that the number of questions shown includes the three sentinel questions asked to the users.
User study name | Original observers | Filtered observers (N) | Questions per method pairs | Total questions |
---|---|---|---|---|
Overall realism | 40 | 35 | 20 | 123 |
Shadow realism | 14 | 13 | 40 | 123 |
Shading realism | 14 | 11 | 40 | 123 |
Lighting control (vs. ZeroComp+SDEdit) | 10 | 8 | 40 | 43 |
Lighting control (vs. no guidance) | 10 | 7 | 40 | 43 |
Below, we show the instructions page of the overall realism user study and a question from it.
And here, we show a video demonstration of the lighting control user study.
We show how the two main parameters of our method can be adjusted for enhanced artistic control, extending sec. 4.2 and sec. 4.3. Here, both parameters are modified separately, but they could be modified simultaneously for optimal results.
Here, we show control over the guidance scale γ, extending fig. 6.
Increasing the guidance scale generally makes the dominant light source stronger.
Move the mouse vertically to adjust the guidance scale, and horizontally to adjust the light
azimuth
angle.
The current light azimuth angle is 0°.
The current guidance scale is γ=3.0.
Here, we show control over the latent mask weight β, extending fig. 7.
Increasing the latent mask weight makes the dominant shadow darker.
Move the mouse vertically to adjust the latent mask weight, and horizontally to adjust the light
azimuth angle.
The current light azimuth angle is 0°.
The current latent weight is β=0.05.
We adjust the light source radius used to generate the coarse shadow fed to SpotLight.
Small radii lead to hard shadows, whereas higher radii generate more diffuse shadows.
Move the mouse vertically to adjust the light source radius, and horizontally to adjust the light
azimuth angle.
The current light azimuth angle is 0°.
The current light radius is 1 (default).
In our paper, we analyze all the methods by conditioning on a single dominant light source, which is sufficiently
realistic in most cases.
Here, we show that we can combine outputs from SpotLight at different light directions to simulate multiple light
sources.
We combine a static light direction (shadow to the right of the object), with a dynamic direction (hover over the
images to move this virtual light). We combine the two lightings
in linear space (by using the gamma of 2.2) using the following equation:
$$x_{\text{combined}}=(0.5 \times {x_{\text{light 1}}}^{2.2} + 0.5 \times {x_{\text{light
2}}}^{2.2})^{\frac{1}{2.2}}.$$
Notice how the static shadow and shading have a clear effect on the combined output.
In our work, we use SpotLight for object relighting. We experiment with extending SpotLight for full-scene relighting using the following approach.
In step 1, we use the same ZeroComp checkpoint as in all other experiments. For steps 2 and 3, in order to properly relight the background, we found that this checkpoint had limited full-scene relighting capabilities. We hypothesize that this is due to the fact that the neural renderer backbone is trained to relight only a small region of an image (circular and rectangular masks at training time). We therefore train a separate model from scratch for 270K iterations on inverted shading masks, where we take the inverse of the circular and rectangular masks for shading masking. Furthermore, we only keep the largest connected component in the mask. This leaves training examples where only a small region of the shading is known and the full background lighting needs to be inferred.
We demonstrate that SpotLight can be applied to 2D images by leveraging a ZeroComp backbone trained without depth maps, using the normals (Stable Normal) and albedo (IID) estimators. Without access to a 3D model, we cannot rely on a rendering engine to cast shadows. Instead, we employ PixHt-Lab to generate realistic soft shadows from a controllable point light coordinate. Here, a sample without shadow is provided for negative guidance \(\mathbf{v}_{t, \text{neg}}\).