Abstract

We present ZeroComp, an effective zero-shot 3D object compositing approach that does not require paired composite-scene images during training. Our method leverages ControlNet to condition from intrinsic images and combines it with a Stable Diffusion model to utilize its scene priors, together operating as an effective rendering engine. During training, ZeroComp uses intrinsic images based on geometry, albedo, and masked shading, all without the need for paired images of scenes with and without composite objects. Once trained, it seamlessly integrates virtual 3D objects into scenes, adjusting shading to create realistic composites. We developed a high-quality evaluation dataset and demonstrate that ZeroComp outperforms methods using explicit lighting estimations and generative techniques in quantitative and human perception benchmarks. Additionally, ZeroComp extends to real and outdoor image compositing, even when trained solely on synthetic indoor data, showcasing its effectiveness in image compositing.

ZeroComp Pipeline

Qualitative Comparison

Extensions: Material editing

Training ZeroComp on InteriorVerse significantly enhances its performance with shiny objects by allowing precise control over roughness and metallic properties.

Extensions: Outdoor images

ZeroComp generalizes to outdoor scenes, despite being trained exclusively on indoor scenes. Note how the object shading and cast shadows seamlessly blend with the target background.

Extensions: 2D object compositing

ZeroComp can also be applied to 2D objects segmented from real images, where a 3D model is not available. Here, we rely on intrinsic estimators to estimate the object depth and normals. We use the RGB as the albedo to avoid detrimental noise in the image texture while keeping the rest of the pipeline unchanged. For demonstration purposes, the object was segmented and placed in the target image manually. The following figure shows several such examples, showing our method can be easily extended to the case of 2D object compositing.

Demo Video

Acknowledgements

This research was supported by NSERC grants RGPIN 2020-04799 and ALLRP 586543-23, Mitacs and Depix. Computing resources were provided by the Digital Research Alliance of Canada. The authors thank Louis-Étienne Messier and Justine Giroux for their help as well as all members of the lab for discussions and proofreading help.