Sparsely-gated Expert Networks for Computer Vision
Scaling Vision with Sparse Mixture of Experts
We present a sparse version of the visiontransformer that matches the performance of state-of-the-art networks, while requiring as little as half of the compute at test-time.
We also propose an extension to the routing algorithm that can prioritize subsets of each input across the entire batch, leading to adaptive per-image compute.
Finally, we demonstrate the potential of this sparse version of the visiontransformer to scale vision models, and train a 15b parameter model that attains 90.35% on imagenet.
Authors
Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, André Susano Pinto, Daniel Keysers, Neil Houlsby