Beyond neural scaling laws: beating power law scaling via data pruning
Widely observed neural scaling laws, in which error falls off as a power of
the training set size, model size, or both, have driven substantial performance
improvements in deep learning. However, these improvements through scaling
alone require considerable costs in compute and energy. Here we focus on the
scaling of error with dataset size and show how both in theory and practice we
can break beyond power law scaling and reduce it to exponential scaling instead
if we have access to a high-quality data pruning metric that ranks the order in
which training examples should be discarded to achieve any pruned dataset size.
We then test this new exponential scaling prediction with pruned dataset size
empirically, and indeed observe better than power law scaling performance on
ResNets trained on CIFAR-10, SVHN, and ImageNet. Given the importance of
finding high-quality pruning metrics, we perform the first large-scale
benchmarking study of ten different data pruning metrics on ImageNet. We find
most existing high performing metrics scale poorly to ImageNet, while the best
are computationally intensive and require labels for every image. We therefore
developed a new simple, cheap and scalable self-supervised pruning metric that
demonstrates comparable performance to the best supervised metrics. Overall,
our work suggests that the discovery of good data-pruning metrics may provide a
viable path forward to substantially improved neural scaling laws, thereby
reducing the resource costs of modern deep learning.
Authors
Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, Ari S. Morcos