Masked autoencoders (mae) are scalable self-supervised learners for computer vision.
They mask random patches of the input image and reconstruct the missing pixels from the latent representation and mask tokens.
We develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens.
We find that masking a high proportion of the input image (e.g., 75%) yields a nontrivial and meaningful self-supervisory task.
Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.
Authors
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick