CogView: Mastering Text-to-Image Generation via Transformers
Text-to-Image generation in the general domain has long been an open problem,
which requires both generative model and cross-modal understanding. We propose
CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance
this problem. We also demonstrate the finetuning strategies for various
downstream tasks, e.g. style learning, super-resolution, text-image ranking and
fashion design, and methods to stabilize pretraining, e.g. eliminating NaN
losses. CogView (zero-shot) achieves a new state-of-the-art FID on blurred MS
COCO, outperforms previous GAN-based models and a recent similar work DALL-E.
Authors
Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, Jie Tang