EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
We launch EVA, a vision-centric foundation model to explore the limits of
visual representation at scale using only publicly accessible data. EVA is a
vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision
features conditioned on visible image patches. Via this pretext task, we can
efficiently scale up EVA to one billion parameters, and sets new records on a
broad range of representative vision downstream tasks, such as image
recognition, video action recognition, object detection, instance segmentation
and semantic segmentation without heavy supervised training. Moreover, we
observe quantitative changes in scaling EVA result in qualitative changes in
transfer learning performance that are not present in other models. For
instance, EVA takes a great leap in the challenging large vocabulary instance
segmentation task: our model achieves almost the same state-of-the-art
performance on LVISv1.0 dataset with over a thousand categories and COCO
dataset with only eighty categories. Beyond a pure vision encoder, EVA can also
serve as a vision-centric, multi-modal pivot to connect images and text. We
find initializing the vision tower of a giant CLIP from EVA can greatly
stabilize the training and outperform the training from scratch counterpart
with much fewer samples and less compute, providing a new direction for scaling
up and accelerating the costly training of multi-modal foundation models. To
facilitate future research, we will release all the code and models at
\url{this https URL}.
Authors
Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, Yue Cao