Paella: A Speed-Painting Text-to-Image Model
Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces
We introduce a novel text-to-image model requiring less than 10 steps to sample high-fidelity images, using a speed-optimized architectureallowing to sample a single image in less than 500 ms, while having 573mparameters.The model operates on a compressed & quantized latent space, it is conditioned on clip embeddings and uses an improved sampling function over previous works.Aside from text-conditional image generation, our model is ableto do latent space interpolation and image manipulations such as inpainting, outpainting, and structural editing.