Swin Transformer: Scaling up to 3 billion parameters and making it capable of training with images of up to resolution
Swin Transformer V2: Scaling Up Capacity and Resolution
We present techniques for scaling up 3 billion parameters and making it capable of training with images of up to resolution.
Our techniques are generally applicable for scaling up vision models, which has not been widely explored as that of nlp language models, partly due to the following difficulties in training and applications : 1) vision models often face instability issues at scale and 2) many downstream vision tasks require high resolution images or windows and it is not clear how to effectively transfer models pre-trained at low resolutions to higher resolution ones.
Using these techniques and self-supervised pre-training, we successfully train a strong 3b swin transformer model and effectively transfer it to various vision tasks involving high-resolution images or windows, achieving the state-of-the-art accuracy on a variety of vision benchmarks.
Authors
Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo