Existing pre-trained models are generally geared towards a particular class
of problems. To date, there seems to be still no consensus on what the right
architecture and pre-training setup should be. This paper presents a unified
framework for pre-training models that are universally effective across
datasets and setups. We begin by disentangling architectural archetypes with
pre-training objectives -- two concepts that are commonly conflated. Next, we
present a generalized and unified perspective for self-supervision in NLP and
show how different pre-training objectives can be cast as one another and how
interpolating between different objectives can be effective. We then propose
Mixture-of-Denoisers (MoD), a pre-training objective that combines diverse
pre-training paradigms together. We furthermore introduce a notion of mode
switching, wherein downstream fine-tuning is associated with specific
pre-training schemes. We conduct extensive ablative experiments to compare
multiple pre-training objectives and find that our method pushes the
Pareto-frontier by outperforming T5 and/or GPT-like models across multiple
diverse setups. Finally, by scaling our model up to 20B parameters, we achieve
SOTA performance on 50 well-established supervised NLP tasks ranging from
language generation (with automated and human evaluation), language
understanding, text classification, question answering, commonsense reasoning,
long text reasoning, structured knowledge grounding and information retrieval.
Our model also achieve strong results at in-context learning, outperforming
175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on
one-shot summarization. We release Flax-based T5X model checkpoints for the 20B
model at
\url{this https URL}.
Authors
Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, Donald Metzler