Knowledge and Document Count in Large Language Models
Large Language Models Struggle to Learn Long-Tail Knowledge
We study the relationship between the knowledge memorized by large language models and the information in their pre-training datasets.We show that a language model s ability to answer a fact-based question relates to how many documents associated with that question were seen during pre-training.We identify these relevant documents by entity linking pre-training datasets and counting documents that contain the same entities as a given question-answerpair.Our results demonstrate strong correlational and causal relationships between accuracy and relevant document count for numerous question answeringdatasets (e.g., triviaqa), pre-training corpora (e.g., roots), and model sizes (e.g., 176b parameters).Moreover, we find that while larger models are betterat learning long-tail knowledge, we estimate that today s models must be scaled by many orders of magnitude to reach competitive performance on questionswith little support in the pre-training data.Finally, we show that retrieval-augmentation can reduce the dependence on relevant document count, presenting a promising approach for capturing the long-tail.
Language models struggle to capture the long-tail of information on the web. Above, we plot accuracy for the BLOOM model family on TriviaQA as a function of how many documents in the model’s pre-training data are relevant to each question.