Transferring Inductive Biases through Knowledge Distillation
Having the right inductive biases can be crucial in many tasks or scenarios
where data or computing resources are a limiting factor, or where training data
is not perfectly representative of the conditions at test time. However,
defining, designing and efficiently adapting inductive biases is not
necessarily straightforward. In this paper, we explore the power of knowledge
distillation for transferring the effect of inductive biases from one model to
another. We consider families of models with different inductive biases, LSTMs
vs. Transformers and CNNs vs. MLPs, in the context of tasks and scenarios where
having the right inductive biases is critical. We study how the effect of
inductive biases is transferred through knowledge distillation, in terms of not
only performance but also different aspects of converged solutions.