A New Class of Latent Diffusion Models Based on Transformer Architectures
Scalable Diffusion Models with Transformers
We train latent diffusion models of images, replacing the commonly-used u-net backbone with a transformer that operates on latentpatches.
We explore a new class of diffusion models based on the transformerarchitecture.
In addition to possessing good scalability properties, our largest models outperform all prior diffusion models on the class-conditional imagenet 512x512 and 256x256 benchmarks, achieving a state-of-the-art forward pass complexity of 2.27 on the latter.
Diffusion models have been at the forefront of recent advances in image-level generative models, yet they all adopt a convolutional convolutional neural architecture as the de-facto choice of backbone.
In this paper, we demystify the significance of architectural choices in diffusion models and offer empirical baselines for future generative modeling research.
We show that the inductive bias is crucial to the performance of diffusion models, and they can be readily replaced with standard designs such as transformers.
As a result, diffusion models are well-poised to benefit from the recent trend of architecture unification by inheriting best practices and training recipes from other domains, as well as retaining favorable properties like scalability, robustness and efficiency.
Result
Diffusion models are an important class of text-to-image models that have been extensively studied in the literature.
We introduce a simple transformer-based backbone for diffusion models that outperforms prior u-net models and inherits the excellent scaling properties of the transformer model class.