NeRDi: Single-View NeRF Synthesis with Language-Guided Diffusion as General Image Priors
2D-to-3D reconstruction is an ill-posed problem, yet humans are good at
solving this problem due to their prior knowledge of the 3D world developed
over years. Driven by this observation, we propose NeRDi, a single-view NeRF
synthesis framework with general image priors from 2D diffusion models.
Formulating single-view reconstruction as an image-conditioned 3D generation
problem, we optimize the NeRF representations by minimizing a diffusion loss on
its arbitrary view renderings with a pretrained image diffusion model under the
input-view constraint. We leverage off-the-shelf vision-language models and
introduce a two-section language guidance as conditioning inputs to the
diffusion model. This is essentially helpful for improving multiview content
coherence as it narrows down the general image prior conditioned on the
semantic and visual features of the single-view input image. Additionally, we
introduce a geometric loss based on estimated depth maps to regularize the
underlying 3D geometry of the NeRF. Experimental results on the DTU MVS dataset
show that our method can synthesize novel views with higher quality even
compared to existing methods trained on this dataset. We also demonstrate our
generalizability in zero-shot NeRF synthesis for in-the-wild images.
Authors
Congyue Deng, Chiyu "Max'' Jiang, Charles R. Qi, Xinchen Yan, Yin Zhou, Leonidas Guibas, Dragomir Anguelov