Large pre-trained models such as CLIP offer consistent accuracy across a
range of data distributions when performing zero-shot inference (i.e., without
fine-tuning on a specific dataset). Although existing fine-tuning approaches
substantially improve accuracy in-distribution, they also reduce
out-of-distribution robustness. We address this tension by introducing a simple
and effective method for improving robustness: ensembling the weights of the
zero-shot and fine-tuned models. Compared to standard fine-tuning, the
resulting weight-space ensembles provide large accuracy improvements
out-of-distribution, while matching or improving in-distribution accuracy. On
ImageNet and five derived distribution shifts, weight-space ensembles improve
out-of-distribution accuracy by 2 to 10 percentage points while increasing
in-distribution accuracy by nearly 1 percentage point relative to standard
fine-tuning. These improvements come at no additional computational cost during
fine-tuning or inference.
Authors
Mitchell Wortsman, Gabriel Ilharco, Mike Li, Jong Wook Kim, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, Ludwig Schmidt