Multimodal Masked Autoencoder: Learning Representations from Both Vision and Language Data via Masked Token Prediction - 42Papers