A major goal of multimodal research is to improve machine understanding of
images and text. Tasks include image captioning, text-to-image generation, and
vision-language representation learning. So far, research has focused on the
relationships between images and text. For example, captioning models attempt
to understand the semantics of images which are then transformed into text. An
important question is: which annotation reflects best a deep understanding of
image content? Similarly, given a text, what is the best image that can present
the semantics of the text? In this work, we argue that the best text or caption
for a given image is the text which would generate the image which is the most
similar to that image. Likewise, the best image for a given text is the image
that results in the caption which is best aligned with the original text. To
this end, we propose a unified framework that includes both a text-to-image
generative model and an image-to-text generative model. Extensive experiments
validate our approach.
Authors
Hang Li, Jindong Gu, Rajat Koner, Sahand Sharifzadeh, Volker Tresp