Versatile Diffusion: Text, Images and Variations All in One Diffusion Model
The recent advances in diffusion models have set an impressive milestone in
many generation tasks. Trending works such as DALL-E2, Imagen, and Stable
Diffusion have attracted great interest in academia and industry. Despite the
rapid landscape changes, recent new approaches focus on extensions and
performance rather than capacity, thus requiring separate models for separate
tasks. In this work, we expand the existing single-flow diffusion pipeline into
a multi-flow network, dubbed Versatile Diffusion (VD), that handles
text-to-image, image-to-text, image-variation, and text-variation in one
unified model. Moreover, we generalize VD to a unified multi-flow multimodal
diffusion framework with grouped layers, swappable streams, and other
propositions that can process modalities beyond images and text. Through our
experiments, we demonstrate that VD and its underlying framework have the
following merits: a) VD handles all subtasks with competitive quality; b) VD
initiates novel extensions and applications such as disentanglement of style
and semantic, image-text dual-guided generation, etc.; c) Through these
experiments and applications, VD provides more semantic insights of the
generated outputs. Our code and models are open-sourced at
this https URL