Galactica: A Large Language Model for Scientific Knowledge Management
Galactica: A Large Language Model for Science
We introduce a large language model that can store, combine and reason about scientific knowledge.
We train on a large scientific corpus of papers, reference material, knowledgebases and many other sources and outperform existing models on a range of scientific tasks.
We believe these results demonstrate the potential for language models as a new interface for science.
We open source the model for the benefit of the scientific community.
Authors
Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, Robert Stojnic
We introduce a new large language model called galactica (gal) for automatically organizing science.
Galactica is trained on a large and curated corpus of humanity s scientific knowledge.
This includes over 48 million papers, textbooks and lecture notes, millions of compounds and proteins, scientific websites, encyclopedias and more.
Unlike existing language models, which rely on an uncurated crawl-based paradigm, our corpus is high-quality and highly curated.
We are able to train on it for multiple epochs without overfitting, where upstream and downstream performance improves with use of repeated tokens.
We also include task-specific datasets in pre-training to facilitate composition of this knowledge into new task contexts.
Our ultimate vision is a single neural network for powering scientific tasks.
Result
We evaluate the scientific capabilities of building an interface for science that can store, combine and reason about scientific knowledge-as these are needed for building a new interface for science.
Specifically, we focus on the high-level design goals of building an llm that can store, combine and reason.
We set up several knowledge probe benchmarks, building off the general language model approach.
These were critical metrics during model development for identifying knowledge gaps within the corpus, and informing how to iterate the corpus.
They also provide insight into the relative knowledge strengths of galactica versus general language models, and we cover these results in this section before turning to the downstream tasks.
We find that performance continues to improve on validation set, in-domain and out-of-domain benchmarks with multiple repeats of the corpus.
We note the implication that the" @xmath0" focus of current llm projects may be overemphasised versus the importance of filtering the corpus for quality.
We see no signs of overfitting suggesting that use of repeated tokens is improving downstream performance as well as upstream performance.