BANMo: Building Animatable 3D Neural Models from Many Casual Videos
Prior work for articulated 3D shape reconstruction often relies on
specialized sensors (e.g., synchronized multi-camera systems), or pre-built 3D
deformable models (e.g., SMAL or SMPL). Such methods are not able to scale to
diverse sets of objects in the wild. We present BANMo, a method that requires
neither a specialized sensor nor a pre-defined template shape. BANMo builds
high-fidelity, articulated 3D models (including shape and animatable skinning
weights) from many monocular casual videos in a differentiable rendering
framework. While the use of many videos provides more coverage of camera views
and object articulations, they introduce significant challenges in establishing
correspondence across scenes with different backgrounds, illumination
conditions, etc. Our key insight is to merge three schools of thought; (1)
classic deformable shape models that make use of articulated bones and blend
skinning, (2) volumetric neural radiance fields (NeRFs) that are amenable to
gradient-based optimization, and (3) canonical embeddings that generate
correspondences between pixels and an articulated model. We introduce neural
blend skinning models that allow for differentiable and invertible articulated
deformations. When combined with canonical embeddings, such models allow us to
establish dense correspondences across videos that can be self-supervised with
cycle consistency. On real and synthetic datasets, BANMo shows higher-fidelity
3D reconstructions than prior works for humans and animals, with the ability to
render realistic images from novel viewpoints and poses. Project webpage:
banmo-www.github.io .
Authors
Gengshan Yang, Minh Vo, Natalia Neverova, Deva Ramanan, Andrea Vedaldi, Hanbyul Joo