Do We Really Need Explicit Position Encodings for Vision Transformers?
Almost all visual transformers such as ViT or DeiT rely on predefined
positional encodings to incorporate the order of each input token. These
encodings are often implemented as learnable fixed-dimension vectors or
sinusoidal functions of different frequencies, which are not possible to
accommodate variable-length input sequences. This inevitably limits a wider
application of transformers in vision, where many tasks require changing the
input size on-the-fly.
In this paper, we propose to employ a conditional position encoding scheme,
which is conditioned on the local neighborhood of the input token. It is
effortlessly implemented as what we call Position Encoding Generator (PEG),
which can be seamlessly incorporated into the current transformer framework.
Our new model with PEG is named Conditional Position encoding Visual
Transformer (CPVT) and can naturally process the input sequences of arbitrary
length. We demonstrate that CPVT can result in visually similar attention maps
and even better performance than those with predefined positional encodings. We
obtain state-of-the-art results on the ImageNet classification task compared
with visual Transformers to date. Our code will be made available at
this https URL .
Authors
Xiangxiang Chu, Bo Zhang, Zhi Tian, Xiaolin Wei, Huaxia Xia