Null-text Inversion for Editing Real Images using Guided Diffusion Models
Recent text-guided diffusion models provide powerful image generation
capabilities. Currently, a massive effort is given to enable the modification
of these images using text only as means to offer intuitive and versatile
editing. To edit a real image using these state-of-the-art tools, one must
first invert the image with a meaningful text prompt into the pretrained
model's domain. In this paper, we introduce an accurate inversion technique and
thus facilitate an intuitive text-based modification of the image. Our proposed
inversion consists of two novel key components: (i) Pivotal inversion for
diffusion models. While current methods aim at mapping random noise samples to
a single input image, we use a single pivotal noise vector for each timestamp
and optimize around it. We demonstrate that a direct inversion is inadequate on
its own, but does provide a good anchor for our optimization. (ii) NULL-text
optimization, where we only modify the unconditional textual embedding that is
used for classifier-free guidance, rather than the input text embedding. This
allows for keeping both the model weights and the conditional embedding intact
and hence enables applying prompt-based editing while avoiding the cumbersome
tuning of the model's weights. Our Null-text inversion, based on the publicly
available Stable Diffusion model, is extensively evaluated on a variety of
images and prompt editing, showing high-fidelity editing of real images.