PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers
This paper explores a better codebook for BERT pre-training of vision
transformers. The recent work BEiT successfully transfers BERT pre-training
from NLP to the vision field. It directly adopts one simple discrete VAE as the
visual tokenizer, but has not considered the semantic level of the resulting
visual tokens. By contrast, the discrete tokens in NLP field are naturally
highly semantic. This difference motivates us to learn a perceptual codebook.
And we surprisingly find one simple yet effective idea: enforcing perceptual
similarity during the dVAE training. We demonstrate that the visual tokens
generated by the proposed perceptual codebook do exhibit better semantic
meanings, and subsequently help pre-training achieve superior transfer
performance in various downstream tasks. For example, we achieve 84.5 Top-1
accuracy on ImageNet-1K with ViT-B backbone, outperforming the competitive
method BEiT by +1.3 with the same pre-training epochs. It can also improve the
performance of object detection and segmentation tasks on COCO val by +1.3 box
AP and +1.0 mask AP, semantic segmentation on ADE20k by +1.0 mIoU, The code and
models will be available at \url{this https URL}.