Multi-viewping in the Wild: A Multi-Camera Approach to Labeling Landmarks in Unconstrained Video Sequences
MBW: Multi-view Bootstrapping in the Wild
Labeling articulated objects in unconstrained settings has a wide variety of applications including entertainment, neuroscience, psychology, ethology, and many fields of medicine.
Large offline labeled datasets do not exist for all but the most common articulated object categories (e.g., humans).
Hand labeling these landmarks within a video sequence is a laborious task that can be error-prone when trained from only a few examples.
Multi-camera systems that train fine-grained detectors have shown significant promise in detecting such errors, allowing for self-supervised solutions that only need a small percentage of the video sequence to be hand-labeled.
The approach, however, is based on calibrated cameras and rigid geometry, making it expensive, difficult to manage, and impractical in real-world scenarios.
In this paper, we address these bottlenecks by combining a non-rigid 3d neural prior with deep flow to obtain high-fidelity landmarkestimates from videos with only two or three uncalibrated, handheld cameras.
With just a few annotations (representing 1-2% of the frames), we are able to produce 2d results comparable to state-of-the-art fully supervised methods, along with 3d reconstructions that are impossible with other existing approaches.
Authors
Mosam Dabhi, Chaoyang Wang, Tim Clifford, Laszlo Attila Jeni, Ian R. Fasel, Simon Lucey