Text2Human: Text-Driven Controllable Human Image Generation
Generating high-quality and diverse human images is an important yet
challenging task in vision and graphics. However, existing generative models
often fall short under the high diversity of clothing shapes and textures.
Furthermore, the generation process is even desired to be intuitively
controllable for layman users. In this work, we present a text-driven
controllable framework, Text2Human, for a high-quality and diverse human
generation. We synthesize full-body human images starting from a given human
pose with two dedicated steps. 1) With some texts describing the shapes of
clothes, the given human pose is first translated to a human parsing map. 2)
The final human image is then generated by providing the system with more
attributes about the textures of clothes. Specifically, to model the diversity
of clothing textures, we build a hierarchical texture-aware codebook that
stores multi-scale neural representations for each type of texture. The
codebook at the coarse level includes the structural representations of
textures, while the codebook at the fine level focuses on the details of
textures. To make use of the learned hierarchical codebook to synthesize
desired images, a diffusion-based transformer sampler with mixture of experts
is firstly employed to sample indices from the coarsest level of the codebook,
which then is used to predict the indices of the codebook at finer levels. The
predicted indices at different levels are translated to human images by the
decoder learned accompanied with hierarchical codebooks. The use of
mixture-of-experts allows for the generated image conditioned on the
fine-grained text input. The prediction for finer level indices refines the
quality of clothing textures. Extensive quantitative and qualitative
evaluations demonstrate that our proposed framework can generate more diverse
and realistic human images compared to state-of-the-art methods.