We investigate whether a large multimodal model trained purely via masked tokenprediction, without using modality-specific encoders or contrastive learning, can learn transferable representations for downstream tasks.
We propose a simple and scalable network architecture, the multimodal masked autoencoder(m3ae), which learns a unified encoder for both vision and language data via masked token prediction.
We provide an empirical study of m3ae trained on a large-scale image-text dataset, and find that it is able to learn generalizable representations that transfer well to downstream tasks.
Surprisingly, we find that m3ae benefits from a higher text mask ratio(50-90%) due to the joint training of two data modalities.
We also provide qualitative analysis
showing that the learned representation incorporates meaningful informationfrom both image and language.