Self-Supervised Pretraining of 3D Features on any Point-Cloud
Pretraining on large labeled datasets is a prerequisite to achieve good
performance in many computer vision tasks like 2D object recognition, video
classification etc. However, pretraining is not widely used for 3D recognition
tasks where state-of-the-art methods train models from scratch. A primary
reason is the lack of large annotated datasets because 3D data is both
difficult to acquire and time consuming to label. We present a simple
self-supervised pertaining method that can work with any 3D data - single or
multiview, indoor or outdoor, acquired by varied sensors, without 3D
registration. We pretrain standard point cloud and voxel based model
architectures, and show that joint pretraining further improves performance. We
evaluate our models on 9 benchmarks for object detection, semantic
segmentation, and object classification, where they achieve state-of-the-art
results and can outperform supervised pretraining. We set a new
state-of-the-art for object detection on ScanNet (69.0% mAP) and SUNRGBD (63.5%
mAP). Our pretrained models are label efficient and improve performance for
classes with few examples.