Panoptic Lifting for 3D Scene Understanding with Neural Fields
We propose a novel approach for learning panoptic 3d-volumetric representations from images of in-the-wild scenes.
Unlike existing approaches which use 3d input directly or indirectly, our method requires only machine-generated 2d panoptic segmentation masks inferred from a pre-trained network.
Our core contribution is a panoptic lifting scheme based on a neural field representation that generates a unified and multi-viewconsistent, 3d panopting representation of the scene.
To account for inconsistencies of 2d instance identifiers across views, we solve a linear assignment with a cost based on the model s current predictions and the machine-generated segmentation masks, thus enabling us to lift 2d instances to 3d in a consistent way.
Experimental results validate our approach on the challenging hypersim,replica, and scannet datasets, improving by 8.4, 13.8, and 10.6% in scene-level point-of - view (pq) over state of the art.
Authors
Yawar Siddiqui, Lorenzo Porzi, Samuel Rota Buló, Norman Müller, Matthias Nießner, Angela Dai, Peter Kontschieder
Panoptic 3d scene understanding models are key to enabling applications such as panoptic view synthesis and scene editing, while maintaining robustness to a variety of diverse input data.
Single-image panoptic segmentation, unfortunately, is still insufficient for tasks requiring coherency and consistency across multiple views.
In fact, panoptic masks often contain view-specific imperfections and inconsistent classifications, and single-image 2d models naturally lack the ability to track unique object identities across views.
Our model is trained from only 2d posed images and corresponding machine-generated 2d panoptic mask fields, and can render color, depth, semantics, and 3d-consistent instance information for novel views of the scene.
Panoptic lifting supports applications like novel panoptic point synthesis, scene editing, and panoptic image understanding.
Result
We introduce panoptic lifting, a novel approach to lift 2d machine-generated panoptic labels to an implicit 3d volumetric representation.
Compared to state of the art, our model is more robust to the inherent noise present in machine-generated labels, hence resulting in significant improvements across datasets while providing the ability to work on in-the-wild scenes.
As a result, our model can produce clean, coherent, and 3d-consistent panoptic segmentation masks together with color and depth images for novel views.