Text-to-Image Generation Grounded by Fine-Grained User Attention
Localized Narratives is a dataset with detailed natural language descriptions
of images paired with mouse traces that provide a sparse, fine-grained visual
grounding for phrases. We propose TReCS, a sequential model that exploits this
grounding to generate images. TReCS uses descriptions to retrieve segmentation
masks and predict object labels aligned with mouse traces. These alignments are
used to select and position masks to generate a fully covered segmentation
canvas; the final image is produced by a segmentation-to-image generator using
this canvas. This multi-step, retrieval-based approach outperforms existing
direct text-to-image generation models on both automatic metrics and human
evaluations: overall, its generated images are more photo-realistic and better
match descriptions.
Authors
Jing Yu Koh, Jason Baldridge, Honglak Lee, Yinfei Yang