eDiffi: Large-scale diffusion-based generative models for text-to-image synthesis
eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
Large-scale text-to-image diffusion models have led to breakthroughs in text-conditioned high-resolution image synthesis.Such models gradually synthesize images in an iterative fashion while conditioning on text prompts.However, we find that their synthesis behavior qualitatively changes throughout this process : early in sampling, generation strongly relies on the text prompt to generate text-aligned content, while later, the text conditioning is almost entirely ignored.This suggests that sharing model parameters throughout the entire generation process may not be ideal.Our ensemble of diffusion models, called ediffi, results in improved text alignment while maintaining the same inference computation cost and preserving high visual quality, outperforming previous large-scale text-to-image diffusion models on the standard benchmark.