Every Model Learned by Gradient Descent Is Approximately a Kernel Machine

We show that deep networks learned by the standard gradient descent algorithm are in fact mathematically approximately equivalent to kernel machines, a learning method that simply memorizes the data and uses it directly for prediction via a similarity function (the kernel).

This greatly enhances the interpretability of deep network weights, by elucidating that they are effectively a superpositionof the training examples.

The network architecture incorporates knowledge of the target function into the kernel.

This improved understanding should lead to better learning algorithms.