3D Object Detection with a Self-supervised Lidar Scene Flow Backbone
Emeç Erçelik, Ekim Yurtsever, Mingyu Liu, Zhijie Yang, Hanzhen Zhang, Pınar Topçam, Maximilian Listl, Yılmaz Kaan Çaylı, Alois Knoll
State-of-the-art 3D detection methods rely on supervised learning and large
labelled datasets. However, annotating lidar data is resource-consuming, and
depending only on supervised learning limits the applicability of trained
models. Against this backdrop, here we propose using a self-supervised training
strategy to learn a general point cloud backbone model for downstream 3D vision
tasks. 3D scene flow can be estimated with self-supervised learning using cycle
consistency, which removes labelled data requirements. Moreover, the perception
of objects in the traffic scenarios heavily relies on making sense of the
sparse data in the spatio-temporal context. Our main contribution leverages
learned flow and motion representations and combines a self-supervised backbone
with a 3D detection head focusing mainly on the relation between the scene flow
and detection tasks. In this way, self-supervised scene flow training
constructs point motion features in the backbone, which help distinguish
objects based on their different motion patterns used with a 3D detection head.
Experiments on KITTI and nuScenes benchmarks show that the proposed
self-supervised pre-training increases 3D detection performance significantly.