Prompt-based Multitask Benchmarking for Large Language Models
Multitask Prompted Training Enables Zero-Shot Task Generalization
Large language models have recently been shown to attain reasonable zero-shot generalization on a diverse set of tasks.
It has been hypothesized that this is a consequence of implicit multitask learning in language model training.
To test this question at scale, we develop a system for easily mapping general natural language tasks into a human-readable prompted form.
We convert a large set of supervised datasets, each with multiple prompts using general natural language, into a human-readable prompted form.
These prompted datasets allow for benchmarking the ability of a model to perform completely unseen tasks specified in naturallanguage.
We fine-tune a pretrained encoder-decoder model on this multitaskmixture covering a wide variety of tasks.
The model attains strong zero-shotperformance on several standard datasets, often outperforming models 16x its size.
Further, our approach attains strong performance on a subset of tasksfrom the big-bench benchmark, outperforming models 6x its size.
Authors
Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta