Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers
Large pretrained language models have shown surprising In-Context Learning
(ICL) ability. With a few demonstration input-label pairs, they can predict the
label for an unseen input without additional parameter updates. Despite the
great success in performance, the working mechanism of ICL still remains an
open problem. In order to better understand how ICL works, this paper explains
language models as meta-optimizers and understands ICL as a kind of implicit
finetuning. Theoretically, we figure out that the Transformer attention has a
dual form of gradient descent based optimization. On top of it, we understand
ICL as follows: GPT first produces meta-gradients according to the
demonstration examples, and then these meta-gradients are applied to the
original GPT to build an ICL model. Experimentally, we comprehensively compare
the behavior of ICL and explicit finetuning based on real tasks to provide
empirical evidence that supports our understanding. The results prove that ICL
behaves similarly to explicit finetuning at the prediction level, the
representation level, and the attention behavior level. Further, inspired by
our understanding of meta-optimization, we design a momentum-based attention by
analogy with the momentum-based gradient descent algorithm. Its consistently
better performance over vanilla attention supports our understanding again from
another aspect, and more importantly, it shows the potential to utilize our
understanding for future model designing.