Improving language models by retrieving from trillions of tokens
Auto-regressive language models can be improved by conditioning on document chunksretrieved from a large corpus, based on local similarity with preceding tokens.
We combine a frozen bert retriever, a differentiable encoder and a chunked cross-attention mechanism to predict tokens based on an order of magnitude more data than what is typically consumed during training.
With a trillion token database, our retrieval-enhanced transformer (retro) achieves comparable performance to gpt-3 and jurassic-1 on the pile, despite using fewer parameters.
After fine-tuning, retro performancetranslates to downstream knowledge-intensive tasks such as question answering.
Authors
Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock