RenderDiffusion: Image Diffusion for 3D Reconstruction, Inpainting and Generation
Diffusion models currently achieve state-of-the-art performance for both
conditional and unconditional image generation. However, so far, image
diffusion models do not support tasks required for 3D understanding, such as
view-consistent 3D generation or single-view object reconstruction. In this
paper, we present RenderDiffusion as the first diffusion model for 3D
generation and inference that can be trained using only monocular 2D
supervision. At the heart of our method is a novel image denoising architecture
that generates and renders an intermediate three-dimensional representation of
a scene in each denoising step. This enforces a strong inductive structure into
the diffusion process that gives us a 3D consistent representation while only
requiring 2D supervision. The resulting 3D representation can be rendered from
any viewpoint. We evaluate RenderDiffusion on ShapeNet and Clevr datasets and
show competitive performance for generation of 3D scenes and inference of 3D
scenes from 2D images. Additionally, our diffusion-based approach allows us to
use 2D inpainting to edit 3D scenes. We believe that our work promises to
enable full 3D generation at scale when trained on massive image collections,
thus circumventing the need to have large-scale 3D model collections for
supervision.
Authors
Titas Anciukevičius, Zexiang Xu, Matthew Fisher, Paul Henderson, Hakan Bilen, Niloy J. Mitra, Paul Guerrero