We present Mobile-Former, a parallel design of MobileNet and Transformer with
a two-way bridge in between. This structure leverages the advantage of
MobileNet at local processing and transformer at global interaction. And the
bridge enables bidirectional fusion of local and global features. Different
with recent works on vision transformer, the transformer in Mobile-Former
contains very few tokens (e.g. less than 6 tokens) that are randomly
initialized, resulting in low computational cost. Combining with the proposed
light-weight cross attention to model the bridge, Mobile-Former is not only
computationally efficient, but also has more representation power,
outperforming MobileNetV3 at low FLOP regime from 25M to 500M FLOPs on ImageNet
classification. For instance, it achieves 77.9\% top-1 accuracy at 294M FLOPs,
gaining 1.3\% over MobileNetV3 but saving 17\% of computations. When
transferring to object detection, Mobile-Former outperforms MobileNetV3 by 8.6
AP.
Authors
Yinpeng Chen, Xiyang Dai, Dongdong Chen, Mengchen Liu, Xiaoyi Dong, Lu Yuan, Zicheng Liu