Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation
We investigate the robustness of vision transformers (ViTs) through the lens
of their special patch-based architectural structure, i.e., they process an
image as a sequence of image patches. We find that ViTs are surprisingly
insensitive to patch-based transformations, even when the transformation
largely destroys the original semantics and makes the image unrecognizable by
humans. This indicates that ViTs heavily use features that survived such
transformations but are generally not indicative of the semantic class to
humans. Further investigations show that these features are useful but
non-robust, as ViTs trained on them can achieve high in-distribution accuracy,
but break down under distribution shifts. From this understanding, we ask: can
training the model to rely less on these features improve ViT robustness and
out-of-distribution performance? We use the images transformed with our
patch-based operations as negatively augmented views and offer losses to
regularize the training away from using non-robust features. This is a
complementary view to existing research that mostly focuses on augmenting
inputs with semantic-preserving transformations to enforce models' invariance.
We show that patch-based negative augmentation consistently improves robustness
of ViTs across a wide set of ImageNet based robustness benchmarks. Furthermore,
we find our patch-based negative augmentation are complementary to traditional
(positive) data augmentation, and together boost the performance further. All
the code in this work will be open-sourced.
Authors
Yao Qin, Chiyuan Zhang, Ting Chen, Balaji Lakshminarayanan, Alex Beutel, Xuezhi Wang