Aligned 3D Representation by Pre-training Visual, Text, and 3D Language Models
ULIP: Learning Unified Representation of Language, Image and Point Cloud for 3D Understanding
Understanding capabilities of current state-of-the-art 3d models are limited by datasets with a small number of annotated data and a pre-defined set of categories.
Motivated by this, we introduce ulip to learn a unified representation of image, text, and 3d point cloud by pre-training with object triplets from the three modalities.
Then, ulip learns a 3d representation space aligned with the common image-text space, using asmall number of automatically synthesized triplets.
Ulip is agnostic to 3d-backbone networks and can easily be integrated into any 3d architecture.
Experiments show that ulip effectively improves the performance of multiple recent 3d backbones by simply pre-training them on shapenet55 using our framework.
Authors
Le Xue, Mingfei Gao, Chen Xing, Roberto Martín-Martín, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, Silvio Savarese
However, compared to its 2d counterpart, 3d data is still limited by datasets with a small number of samples and a small set of pre-determined categories.
This scale limit of 3d data, caused by the high cost of 3d data collection and annotation, has been hindering the generalization of 3d recognition models and their real-world applications.
To tackle the shortage of annotated data, existing work in other domains shows that employing knowledge from different modalities can significantly help the concept understanding in the original modality.
However, multimodal learning that involves 3d modality and whether it can help 3d recognition tasks are still not well studied.
In this paper, we propose learning nified representation of anguage, mage and oint cloud (ulip) for 3d understanding.
Our framework takes advantage of a vision-language model pre-trained on massive image-text pairs, and align the feature space of a 3d point cloud encoder to the pre-aligned vision/language feature space.
It improves state-of-the-art visual concept recognition and enables zero-shot classification of unseen objects.
Result
We propose a pre-training framework that aligns multiple modalities of image, text, and point cloud in the same feature space.
Our method achieves state-of-the-art performance in both zero-shot and standard 3d classification tasks, and our qualitative results show that our method has promising potential for cross-modal retrieval applications.