We present Depth-aware Image-based NEural Radiance fields (DINER). Given a
sparse set of RGB input views, we predict depth and feature maps to guide the
reconstruction of a volumetric scene representation that allows us to render 3D
objects under novel views. Specifically, we propose novel techniques to
incorporate depth information into feature fusion and efficient scene sampling.
In comparison to the previous state of the art, DINER achieves higher synthesis
quality and can process input views with greater disparity. This allows us to
capture scenes more completely without changing capturing hardware requirements
and ultimately enables larger viewpoint changes during novel view synthesis. We
evaluate our method by synthesizing novel views, both for human heads and for
general objects, and observe significantly improved qualitative results and
increased perceptual metrics compared to the previous state of the art. The
code will be made publicly available for research purposes.