Exploiting Geometric Structure in Monocular Video Frames for Self-Supervised Depth Estimation
$S^3$Net: Semantic-Aware Self-supervised Depth Estimation with Monocular Videos and Synthetic Data
Depth estimation with monocular cameras enables the possibility of widespread use of cameras as low-cost depth estimation sensors in applications such as autonomous driving and robotics.
However, learning such a scalable depth estimation model would require a lot of labeled data which is expensive to collect.
We present a self-supervised framework which combines these complementary features : we use synthetic and real-world images for training while exploiting geometric, temporal, as well as semantic constraints across space and time in monocular video frames.
We present a unique way to train this self-supervised framework, and achieve (i) more than improvement over previous synthetic supervised approaches that use domain adaptation and (ii) more than improvement over previous self-supervised approaches which exploit geometric constraints from the real data.
Our novel consolidated architecture provides a new state-of-the-art in self-supervised depth estimation using monocular videos.