Generating and editing images from text prompts using multimodal encoders - 42Papers