Convolutional Neural Networks (CNNs) are the go-to model for computer vision.
Recently, attention-based networks, such as the Vision Transformer, have also
become popular. In this paper we show that while convolutions and attention are
both sufficient for good performance, neither of them are necessary. We present
MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs).
MLP-Mixer contains two types of layers: one with MLPs applied independently to
image patches (i.e. "mixing" the per-location features), and one with MLPs
applied across patches (i.e. "mixing" spatial information). When trained on
large datasets, or with modern regularization schemes, MLP-Mixer attains
competitive scores on image classification benchmarks, with pre-training and
inference cost comparable to state-of-the-art models. We hope that these
results spark further research beyond the realms of well established CNNs and
Transformers.
Authors
Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy