The Deep Bootstrap: Good Online Learners are Good Offline Generalizers
We propose a new framework for reasoning about generalization in deep
learning. The core idea is to couple the Real World, where optimizers take
stochastic gradient steps on the empirical loss, to an Ideal World, where
optimizers take steps on the population loss. This leads to an alternate
decomposition of test error into: (1) the Ideal World test error plus (2) the
gap between the two worlds. If the gap (2) is universally small, this reduces
the problem of generalization in offline learning to the problem of
optimization in online learning. We then give empirical evidence that this gap
between worlds can be small in realistic deep learning settings, in particular
supervised image classification. For example, CNNs generalize better than MLPs
on image distributions in the Real World, but this is "because" they optimize
faster on the population loss in the Ideal World. This suggests our framework
is a useful tool for understanding generalization in deep learning, and lays a
foundation for future research in the area.