Existing language models predict tokens with a softmax over a finite vocabulary, which can make it difficult to predict rare tokens or phrases.
We introduce the first nonparametric masked language model (npm) that replaces this softmax with a nonparametric distribution over every phrase in a reference corpus.
We show that npm can be efficiently trained with a contrastive objective and an in-batch approximation to full corpus retrieval.
Zero-shot evaluation on 9 closed-set tasks and 7 open-set tasks demonstrates that npm outperforms significantly larger parametric models, either with or without a retrieve-and-generate approach.
It is particularly better on dealing with rare patterns (word senses or facts) and predicting rare or nearly unseen words(e.g., non-latin script).
Authors
Sewon Min, Weijia Shi, Mike Lewis, Xilun Chen, Wen-tau Yih, Hannaneh Hajishirzi, Luke Zettlemoyer
Large language models, despite their wide use and impressive performance, are expensive to scale, difficult to update, and struggle with long-tail knowledge and patterns.
In this paper, we introduce, the first on arametric asked language model that predicts tokens solely based on a nonparametric distribution over in a text corpus.
Can predict extremely rare and even completely unseen words, and support effectively unlimited vocabulary sizes.
Is significantly more parameter-efficient, outperforming up to 500x larger parametric models and up to 37x larger retrieve-and-generate models.
Result
We introduce, a nonparametric masked language model that replaces a softmax over the output vocabulary with a nonparametry distribution over every phrase in a reference corpus.
Can efficiently be trained using a contrastive objective and an in-batch approximation to a full corpus.
Extensive evaluation on 9 closed-set tasks and 7 open-set tasks in a zero-shot setup shows that outperforms significantly larger parametric models on rare patterns (word senses or facts), scaling and updating at test time, and predicting extremely rare if not unseen characters in non-latin languages.