We propose a retrieval-augmented multimodal model, which enables a base multimodal model (generator) to refer to relevant knowledge fetched by a retriever from external memory (e.g., multimodal documents on the web).
We implement a retriever using the pretrained clip model and agenerator using the cm3 transformer architecture, and train this model using the laion dataset.
Our resulting model, named retrieval-augmented cm3 (ra-cm3), is the first multimodal models that can retrieve and generate mixtures of text and images.
We show that ra-cm3 significantly outperforms baseline multimodalmodels such as dall-e and cm3 on both image and caption generation tasks (12 fid and 17 cider improvements on ms-coco), while requiring much less compute for training (30% of the compute of dall-e) and exhibiting novel capabilities such as knowledge-intensive image generation and multimodal in-context learning.
Authors
Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Rich James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, Wen-tau Yih
In this work, we present the first retrieval-augmented multimodal model that can retrieve and generate text and images.
Given input text, such a model uses a that retrieves relevant documents from an external memory, and lets a use the retrieved documents to make better predictions.
Our input data and external memory comprise a set of, each of which is either an image, text or a mixture (concatenation) of them.
We design a retriever and a generator that handle multimodal documents, consisting of images and text.
Given this retriever, we design a technique to retrieve diverse and informative documents from the external memory for the input document.
Specifically, we prepend the retrieved documents as in-context examples to the main input document, and train the generator by jointly optimizing token prediction loss for the main document and retrieved documents.
Result
Knowledge-intensive generation () is one of the most important tasks in machine learning.
However, knowledge-intensive multimodal generation () has so far been limited to text-or image-to-text generation.
We then apply an off-the-shelf super-resolution tool @cite2 to generate faithful images from entity-rich captions that contain a rare or unseen composition of entities (e.g., “french flag on the moon”, “mount rushmore” “japanese cherry”).
Because of the retrieval capability, we show novel qualitative capabilities of our model, such as knowledge-intensive multidimensional generation () and multimodal in-context learning ().