Overview of the CDCD framework. Components with learnable parameters are marked with
We propose a framework for modelling categorical data with diffusion models that are continuous both in time and input space.
We demonstrate its efficacy on several language modelling tasks.
Authors
Sander Dieleman, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin, Pierre H. Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, Curtis Hawthorne, Rémi Leblond, Will Grathwohl, Jonas Adler
The development of diffusion models has seen a rapid increase in scale and capabilities over the past few years, across many modalities, including images, audio signals, video and text.
In language modelling, the focus has been on scaling up and expanding the capabilities of autoregressive models, instigated by the development of the transformer architecture.
This has resulted in general-purpose language models that are suitable for practical use.
Diffusion-based language models have seen relatively little success so far.
This is in part due to the discrete categorical nature of textual representations of language, which standard diffusion models are ill-equipped to deal with.
As a result, several approaches to language modelling have recently been proposed, but these depart from the diffusion modelling framework used for perceptual data in several important ways (with a few exceptions, e.g.
This usually implies having to give up some of the unique capabilities of this model class, such as the ability to use classifier-free guidance to enhance conditional generation, which has been instrumental to the success of diffusion-based text-conditional image generators.