ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts
Recent progress in diffusion models has revolutionized the popular technology
of text-to-image generation. While existing approaches could produce
photorealistic high-resolution images with text conditions, there are still
several open problems to be solved, which limits the further improvement of
image fidelity and text relevancy. In this paper, we propose ERNIE-ViLG 2.0, a
large-scale Chinese text-to-image diffusion model, which progressively upgrades
the quality of generated images~by: (1) incorporating fine-grained textual and
visual knowledge of key elements in the scene, and (2) utilizing different
denoising experts at different denoising stages. With the proposed mechanisms,
ERNIE-ViLG 2.0 not only achieves the state-of-the-art on MS-COCO with zero-shot
FID score of 6.75, but also significantly outperforms recent models in terms of
image fidelity and image-text alignment, with side-by-side human evaluation on
the bilingual prompt set ViLG-300.