Transformer, an attention-based encoder-decoder architecture, has
revolutionized the field of natural language processing. Inspired by this
significant achievement, some pioneering works have recently been done on
adapting Transformerliked architectures to Computer Vision (CV) fields, which
have demonstrated their effectiveness on various CV tasks. Relying on
competitive modeling capability, visual Transformers have achieved impressive
performance on multiple benchmarks such as ImageNet, COCO, and ADE20k as
compared with modern Convolution Neural Networks (CNN). In this paper, we have
provided a comprehensive review of over one hundred different visual
Transformers for three fundamental CV tasks (classification, detection, and
segmentation), where a taxonomy is proposed to organize these methods according
to their motivations, structures, and usage scenarios. Because of the
differences in training settings and oriented tasks, we have also evaluated
these methods on different configurations for easy and intuitive comparison
instead of only various benchmarks. Furthermore, we have revealed a series of
essential but unexploited aspects that may empower Transformer to stand out
from numerous architectures, e.g., slack high-level semantic embeddings to
bridge the gap between visual and sequential Transformers. Finally, three
promising future research directions are suggested for further investment.
Authors
Yang Liu, Yao Zhang, Yixin Wang, Feng Hou, Jin Yuan, Jiang Tian, Yang Zhang, Zhongchao Shi, Jianping Fan, Zhiqiang He