A Brief Prehistory of Double Descent
Marco Loog, Tom Viering, Alexander Mey, Jesse H. Krijthe, David M.J. Tax
In their thought-provoking paper [1], Belkin et al. illustrate and discuss
the shape of risk curves in the context of modern high-complexity learners.
Given a fixed training sample size $n$, such curves show the risk of a learner
as a function of some (approximate) measure of its complexity $N$. With $N$ the
number of features, these curves are also referred to as feature curves. A
salient observation in [1] is that these curves can display, what they call,
double descent: with increasing $N$, the risk initially decreases, attains a
minimum, and then increases until $N$ equals $n$, where the training data is
fitted perfectly. Increasing $N$ even further, the risk decreases a second and
final time, creating a peak at $N=n$. This twofold descent may come as a
surprise, but as opposed to what [1] reports, it has not been overlooked
historically. Our letter draws attention to some original, earlier findings, of
interest to contemporary machine learning.