Finding Differences Between Transformers and ConvNets Using Counterfactual Simulation Testing
A sample of objects in our proposed
Modern deep neural networks tend to be evaluated on static test sets with fine-grained naturalistic variations such as object pose, scale, viewpoint, lighting and 3d occlusions.
Or"would your classification still be correct when the object were partially occluded by another object?".
Our method allows for a fair comparison of the recent recently released, state-of-the-art convolutional neural networks and visiontransformers, with respect to these naturalistically variations.
We find evidence that convnext is more robust to pose and scale variations than swin, that convnext generalizes better to our simulated domain and that swin handles partial occlusion better than convnext.
Authors
Nataniel Ruiz, Sarah Adel Bargal, Cihang Xie, Kate Saenko, Stan Sclaroff
Vision transformers (vits) have been proposed as an alternative deep neural network model to rival convolutional neural networks (convnets) for computer vision tasks.
Although the competition between these two classes of architectures has not yet been decided, there have been high-profile studies of potential advantages of vits compared to convnets.
A more general limitation to works that compare properties of vits and convnets is that, even though they try to compare models of similar sizes and imagenet accuracies, they do not account for the fact that.
This work allows for a closer inspection of whether transformers are superior to convnets due to the difference in inductive biases between transformer and convolutional layers.
Result
We conduct a counterfactual comparative study of swin transformers and convnext networks by proposing a novel realistic synthetic dataset of naturalistic scene variations.
We find that (1) convnext networks are more robust to the simulated domain shift than swin transformers (2) convnext networks are better robust to scale and pose variations than swin transformsers (3) swin transformers are more robust than convnext networks with respect to partial occlusion (4) robustness for all factors increases when network size increases (for both classes of networks) and when dataset size increases.
To further increase robustness, we study 4 different sizes of these networks with 2 different training datasets (imagenet-1k and imagenet-22k).
Our work can be used to address model bias, e.g.
Face analysis networks with a respective simulator.