The Stack: A Large Language Model Dataset for Code Generation
The Stack: 3 TB of permissively licensed source code
Histogram of the amount of data per programming language for the permissive license dataset. Note that we plot the dataset size on a log scale.
We introduce a 3.1 tb dataset consisting of permissively licensed source code in 30 programming languages.
We make the dataset available at this https url, provide a tool called"am i in the stack"(this https url) for developers to search for copies of their code, and provide a process for code to be removed from the dataset by following the instructions at this https url.
Authors
Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, Harm de Vries
Large transformer models are pre-trained on large internet corpora and have shown impressive zero and few-shot performance on numerous natural language processing tasks, often by prompting the model with a natural language description of the task at hand.
We argue that the research community would make progress faster if high-quality pre-training datasets, supported by data cards, were more broadly shared.
The dataset is a useful resource for developing competitive competitive code large transformer models trained on large collections of source code and enable the synthesis of programs from both natural language descriptions and other code snippets.
Such models can assist professional developers with programming tasks, for example, by auto-completing code snippets, generating docstrings for a given function signature and body, or suggesting unit tests for a codebase.
Result
We introduce a large dataset of more than 3 tb of permissively licensed source code for 30 common programming languages.
This paper describes the details of the dataset collection, presents a brief dataset analysis, and shows promising results on the humaneval benchmark.
Our experimental results show that near-deduplication is an important pre-processing step for achieving competitive results on text2code benchmarks.