MagicPony: Learning 3D Image Representations from Single-View Images
MagicPony: Learning Articulated 3D Animals in the Wild
We consider the problem of learning a function that can estimate the 3dshape, articulation, viewpoint, texture, and lighting of an articulated animallike a horse, given a single test image.
We present a new method, dubbed magicpony, that learns this function purely from in-the-wild single-view images of the object category, with minimal assumptions about the topology of deformation.
At its core is an implicit-explicit representation of articulatedshape and appearance, combining the strengths of neural fields and meshes.
In order to help the model understand an object s shape and pose, we distil the knowledge captured by an off-the-shelf self-supervised vision transformer and fuse it into the 3d model.
To overcome common local optima in viewpointestimation, we further introduce a new viewpoint sampling scheme that comes at no added training cost.
Compared to prior works, we show significant quantitative and qualitative improvements on this challenging task.
The model also demonstrates excellent generalisation in reconstructing abstract drawings and artefacts, despite the fact that it is only trained on real images.
Authors
Shangzhe Wu, Ruining Li, Tomas Jakab, Christian Rupprecht, Andrea Vedaldi
We propose a novel approach to learning 3d models of articulated object categories such as horses and birds with only input images for training, which we dub.
For training, we only require a 2d segmenter for the objects and a description of the topology and symmetry of the 3d skeleton (, the number of legs).
We leverage recent progress in unsupervised representation learning, unsupervised image matching, efficient implicit-explicit shape representations and neural rendering, to devise a new auto-encoder architecture that reconstructs the 3d shape, articulation and texture of each object instance from a single test image.
From this, we learn a function that, at test time, can estimate the shape and texture of a new object from a single image, in a feed-forward manner.
The function exhibits remarkable generalisation properties, capable of reconstructing objects in, despite being trained on real images only.
Result
We introduce a new model that can learn a 3-dimensional model of an articulated object category from single-view images taken in the wild.
This model can, at test time, reconstruct the shape, articulation, albedo, and lighting of the object from a single image, and generalises to abstract drawings.
Our approach demonstrates the power of combining several recent improvements in self-supervised representation learning together with a new viewpoint sampling scheme.