Sketch-Guided Text-to-Image Diffusion Models
Text-to-Image models have introduced a remarkable leap in the evolution of
machine learning, demonstrating high-quality synthesis of images from a given
text-prompt. However, these powerful pretrained models still lack control
handles that can guide spatial properties of the synthesized images. In this
work, we introduce a universal approach to guide a pretrained text-to-image
diffusion model, with a spatial map from another domain (e.g., sketch) during
inference time. Unlike previous works, our method does not require to train a
dedicated model or a specialized encoder for the task. Our key idea is to train
a Latent Guidance Predictor (LGP) - a small, per-pixel, Multi-Layer Perceptron
(MLP) that maps latent features of noisy images to spatial maps, where the deep
features are extracted from the core Denoising Diffusion Probabilistic Model
(DDPM) network. The LGP is trained only on a few thousand images and
constitutes a differential guiding map predictor, over which the loss is
computed and propagated back to push the intermediate images to agree with the
spatial map. The per-pixel training offers flexibility and locality which
allows the technique to perform well on out-of-domain sketches, including
free-hand style drawings. We take a particular focus on the sketch-to-image
translation task, revealing a robust and expressive way to generate images that
follow the guidance of a sketch of arbitrary style or domain. Project page:
sketch-guided-diffusion.github.io