Stable generative diffusion models with fine-grained image editing
The Stable Artist: Steering Semantics in Diffusion Latent Space
We present the stable artist, an image editing approach enabling fine-grained control of the image generation process from text alone.
The main component is semantic guidance (sega) which steers the diffusion process along variable numbers of semanticdirections.
This allows for subtle edits to images, changes in composition and style, as well as optimization of the overall artistic conception.
Furthermore, sega enables probing of latent spaces to gain insights into the representationof concepts learned by the model, even complex ones such as'carbon emission'.
We demonstrate the stable artist on several tasks, showcasing high-quality image editing and composition.
Authors
Manuel Brack, Patrick Schramowski, Felix Friedrich, Dominik Hintersdorf, Kristian Kersting
Text-to-image diffusion models are now possible to generate text and images simply based on text input, producing impressive results on generative image tasks.
However, unraveling the concepts they learn during training and understanding how to influence what they actually output remains an open question.
We present the stable artist, an iterative approach for guiding a generated image toward the desired output.
Fine-grained control over the generated output and its elements is imperative but hardly feasible through current methods.
The required amount of control is generally only possible through providing image masks in combination with an edit instruction of the masked area.
This is inherently limited, as it discards important structural information and the global composition of the image.
Other approaches like prompt-to-prompt (p2p) rely on a form of soft, implicit masking by interacting with the attention masks of the input prompt.
However, the granularity of control remains limited to the rather core-grained dimensions of the attention mask, and these approaches are inherently restricted to one editing operation at a time.
On the other hand, composable diffusion does enable conditioning on multiple concepts but only provides control over the initial image composition and does not support more subtle changes.
Result
We present the stable artist for directly interacting with concepts in the latent space of the model.
We introduce semantic guidance (sega) that allows one to influence/steer the diffusion process along several directions.
We demonstrate that the resulting image offers fine-grained control over the generated image for performing sophisticated image composition and editing.