Character-Aware Models Improve Visual Text Rendering
Current image generation models struggle to reliably produce well-formed
visual text. In this paper, we investigate a key contributing factor: popular
text-to-image models lack character-level input features, making it much harder
to predict a word's visual makeup as a series of glyphs. To quantify the extent
of this effect, we conduct a series of controlled experiments comparing
character-aware vs. character-blind text encoders. In the text-only domain, we
find that character-aware models provide large gains on a novel spelling task
(WikiSpell). Transferring these learnings onto the visual domain, we train a
suite of image generation models, and show that character-aware variants
outperform their character-blind counterparts across a range of novel text
rendering tasks (our DrawText benchmark). Our models set a much higher
state-of-the-art on visual spelling, with 30+ point accuracy gains over
competitors on rare words, despite training on far fewer examples.
Authors
Rosanne Liu, Dan Garrette, Chitwan Saharia, William Chan, Adam Roberts, Sharan Narang, Irina Blok, RJ Mical, Mohammad Norouzi, Noah Constant