Exploring Plain Vision Transformer Backbones for Object Detection
We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone
network for object detection. This design enables the original ViT architecture
to be fine-tuned for object detection without needing to redesign a
hierarchical backbone for pre-training. With minimal adaptations for
fine-tuning, our plain-backbone detector can achieve competitive results.
Surprisingly, we observe: (i) it is sufficient to build a simple feature
pyramid from a single-scale feature map (without the common FPN design) and
(ii) it is sufficient to use window attention (without shifting) aided with
very few cross-window propagation blocks. With plain ViT backbones pre-trained
as Masked Autoencoders (MAE), our detector, named ViTDet, can compete with the
previous leading methods that were all based on hierarchical backbones,
reaching up to 61.3 box AP on the COCO dataset using only ImageNet-1K
pre-training. We hope our study will draw attention to research on
plain-backbone detectors. Code will be made available.