M-VADER: a diffusion model for image generation using combinations of images and text
M-VADER: A Model for Diffusion with Multimodal Context

M-VADER: image synthesis with multimodal context. The guidance prompt, comprised of interleaved images and text, is embedded using a multimodal decoder, S-MAGMA. The output of S-MAGMA is used to condition the generation process of a fine-tuned version of Stable Diffusion via cross-attention. This allows the finetuned version of Stable Diffusion to convert the starting (random noise) image into the output image shown.
Diffusion models have been introduced that make it possible to specify the output image using a text prompt.Inspired by the success of those models, and led by the notion that language was already developed to describe the elements of visual contexts that humans find most important, we introduce an embedding model closely related to a vision-language model.Specifically, we introduce the embedding model s-magma: a 13 billionparameter multimodal decoder combining components from an autoregressive vision-language model magma and biases finetuned for semantic search.