Every Model Learned by Gradient Descent Is Approximately a Kernel Machine
We show that deep networks learned by the standard gradient descent algorithm are in fact mathematically approximately equivalent to kernel machines, a learning method that simply memorizes the data and uses it directly for prediction via a similarity function (the kernel).
This greatly enhances the interpretability of deep network weights, by elucidating that they are effectively a superpositionof the training examples.
The network architecture incorporates knowledge of the target function into the kernel.
This improved understanding should lead to better learning algorithms.