Efficient Vision Transformers for MobileNet Deployment
Rethinking Vision Transformers for MobileNet Size and Speed
We investigate the design choices of vision transformers and propose an improved supernet with low latency and high parameter efficiency.
The proposed models achieve about higher top-1 accuracy than mobilenet and on imagenet-1k with similar latency and parameters.
We further introduce a fine-grained joint search strategy that can find efficient architectures by optimizing latency and number of parameters simultaneously.
Authors
Yanyu Li, Ju Hu, Yang Wen, Georgios Evangelidis, Kamyar Salahi, Yanzhi Wang, Sergey Tulyakov, Jian Ren
Vision transformers (vits) have inspired many follow-up works to further refine the model architecture and improve training strategies, leading to superior results on most computer vision benchmarks, such as classification, segmentation, detection, and image synthesis.
However achieving satisfactory performance, the latency and model size are still less competitive compared to lightweight convolution neural networks (cnns), especially on resource-constrained mobile devices, limiting their wide deployment in real-world applications.
Many research efforts are taken to alleviate this limitation.
Among them, one direction is to reduce the quadratic computation complexity of the attention mechanism such that the receptive field is constrained to a pre-defined window size, which also inspires subsequent work to refine attention patterns.
Another track is to combine lightweight cnn and attention mechanism to form a hybrid architecture.
The benefit comes two-fold.
First, convolutions are shift invariant and are good at capturing local and detailed information, which can be considered as a good complement to vits.
Second, by placing convolutions in the early stages while placing multi head self attention (mhsa) in the last several stages to model global dependency, we can naturally avoid performing mhsa on high resolution and save computations.
Result
Hybrid vision backbones have emerged as a promising platform for hybrid vision inference.
We further propose a fine-grained joint search on size and speed, and obtain the efficientformerv2 model family that is both lightweight and ultra-fast in inference speed.