Generating and editing images from text prompts using multimodal encoders
VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance
Generating and editing images from open domain text prompts is a challenging task that heretofore has required expensive and specially trained models.We demonstrate a novel methodology for both tasks which is capable of producing images of high visual quality from text prompts of significant semanticcomplexity without any training by using a multimodal encoder to guide imagegenerations.We demonstrate on a variety of tasks how using a multimodal encoder to guide vqgan [11] produces higher visual quality outputs than prior, less flexible approaches like dall-e [38], glide [33] and open-edit [24] despite not being trained for the tasks presented.