Knowledge and Document Count in Large Language Models
Large Language Models Struggle to Learn Long-Tail Knowledge
Language models struggle to capture the long-tail of information on the web. Above, we plot accuracy for the BLOOM model family on TriviaQA as a function of how many documents in the model’s pre-training data are relevant to each question.
We study the relationship between the knowledge memorized by large language models and the information in their pre-training datasets.
We show that a language model s ability to answer a fact-based question relates to how many documents associated with that question were seen during pre-training.
We identify these relevant documents by entity linking pre-training datasets and counting documents that contain the same entities as a given question-answerpair.
Our results demonstrate strong correlational and causal relationships between accuracy and relevant document count for numerous question answeringdatasets (e.g., triviaqa), pre-training corpora (e.g., roots), and model sizes (e.g., 176b parameters).
Moreover, we find that while larger models are betterat learning long-tail knowledge, we estimate that today s models must be scaled by many orders of magnitude to reach competitive performance on questionswith little support in the pre-training data.
Finally, we show that retrieval-augmentation can reduce the dependence on relevant document count, presenting a promising approach for capturing the long-tail.
Authors
Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, Colin Raffel
Large language models (llms) trained on text from the internet capture many facts about the world, ranging from well-known factoids to esoteric domain-specific information.
Given the scale of today s pre-training datasets and llms, one would hope that they can learn a huge amount of information from web-sourced text.
However, not all of the knowledge on the internet appears equally often.
There is a long-tail of information that appears rarely or only once.
In this work, we explore the relationship between the knowledge learned by an llm and the information in its pre-training dataset.
Specifically, we study how an llm s ability to answer a question relates to how many documents associated with that question were seen during pre-training.
Result
Large language models (llm) demonstrate impressive few-shot learning capabilities that arise from simply training on large-scale internet text.
With the open-source release of llms and their associated pre-training datasets, the research community can now begin to understand the origins of these capabilities.
In our case, our results are negative: while llms achieve moderate performance on open-domain qa benchmarks, they are mainly successful on questions that probe knowledge that appears widely in their pre-training data itself.
Our work raises numerous directions for further inquiry, namely, how to improve retention of long-tail knowledge given that simply scaling up model and dataset size will likely be insufficient.
Moreover, we focus on knowledge learning as it relates to factoid question answering, but leave open the question of whether similar relationships exist for other types of tasks, be it knowledge-intensive or otherwise.