Text-based Contrastive Vision and Language Training
I Can't Believe There's No Images! Learning Visual Tasks Using only Language Data
This paper explores the joint embedding space of contrastively trained vision and language encoders and shows that it is possible to learn high-level skills from text data and then use them to complete vision tasks without ever training on visual training data.
We produce models using only text training data on three tasks : image captioning, visual entailment and visual question answering, and evaluate them on standard benchmarks using images.
We find that this kind of transfer is possible and results in only a small drop in performance relative to models trained on images.
We also showcase a variety of stylistic image captioning models that were trained using no image data and no human-curated languagedata, but instead text data from books, the web, or language models.
We study the possibility of cross-modal transfer between visual and natural language processing (nlp) tasks by training models to complete a task using only natural language data and then testing them on the same task with visual inputs instead of text.
We call this setting because it requires applying skills learned from one modality to a different one.
Accomplishing this requires encoding images and text into a shared semantic space.
We propose a method called cross modal transfer on semantic embeddings to take advantage of these encoders.
Result
Contrastive models can be used for cross-modal generalization through, demonstrated an application to stylistic captioning, and studied sensitivity and trained adapters.