Conditional image synthesis from semantic layouts of any precision levels
SceneComposer: Any-Level Semantic Image Synthesis
We propose a new framework for conditional image synthesis from semantic layouts of any precision levels, ranging from pure text to a 2d semantic canvas with precise shapes.
More specifically, the input layout consists of one or more semantic regions with free-form text descriptions and adjustable precision levels, which can be set based on the desired controllability.
We introduce several novel techniques to address the challenges coming with this new setup, including a pipeline for collecting training data ; a precision-encoded mask pyramid and a text feature map representation to jointly encode precision level, semantics, and composition information ; and a multi-scale guided diffusion model to synthesize images.
To evaluate the proposed method, we collect a test dataset containing user-drawn layouts with diverse scenes and styles.
Experimental results show that the proposed method can generate high-quality images following the layout at given precision, and compares favorably against existing methods.
Authors
Yu Zeng, Zhe Lin, Jianming Zhang, Qing Liu, John Collomosse, Jason Kuen, Vishal M. Patel
We propose a new unified conditional image synthesis framework to generate images from a semantic layout at any combination of precision levels.
It is inspired by the typical coarse-to-fine workflow of artists and designers : they first start from an idea, which can be expressed as a text prompt or a set of concepts, then tend to draw the approximate outlines and refine each object.
More specifically, we model a semantic layout as a set of semantic regions with free-form texts descriptions.
Each region can have a precision level to control how well the generated object should fit to the specified shape.
The framework reduces to text-to-image generation when the layout is the coarsest, and it becomes segmentation-to -image generation (s2i)when the layout is a segmentation map.
By adjusting the precision level, users can achieve their desired controllability.
Result
This paper presents a new conditional image synthesis framework to generate images from any-level open-domain semantic layouts.
The input level ranges from pure text to a 2d semantic canvas with precise shapes.
Several novel techniques are introduced, including a pipeline for collecting training data ; the representations to jointly encode precision level, semantics, and geometry information ; and a multi-scale guided diffusion model to synthesize images.
A test dataset containing user-drawn layouts is collected to evaluate the proposed method.
Experimental results demonstrate the advantage of the unified framework.