We reexamine the design spaces and test the limits of what a pure convnet can achieve on general computer vision tasks such as object detection and semantic segmentation.
We gradually"modernize"a standard resnet toward the design of a vision transformer, and discover several key components that contribute to the performance difference along the way.
The outcome of this exploration is a family of pure convnet models dubbed convnext.constructed entirely from standard convnet modules, convnexts compete favorably with transformers in terms of accuracy and scalability, achieving 87.8% imagenet top-1 accuracy and outperforming hierarchical transformers on object detection and semantic segmentation.