This paper explores a simple method for improving the zero-shot learning
abilities of language models. We show that instruction tuning -- finetuning
language models on a collection of tasks described via instructions --
substantially boosts zero-shot performance on unseen tasks.
We take a 137B parameter pretrained language model and instruction-tune it on
over 60 NLP tasks verbalized via natural language instruction templates. We
evaluate this instruction-tuned model, which we call FLAN, on unseen task
types. FLAN substantially improves the performance of its unmodified
counterpart and surpasses zero-shot 175B GPT-3 on 19 of 25 tasks that we
evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE,
BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number
of tasks and model scale are key components to the success of instruction
tuning.
Authors
Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, Quoc V. Le