Visual Representations Learned in Text-Only Language Models
Linearly Mapping from Image to Text Space
Text-only models are trained to represent the physical, non-linguistic world, but the extent to which text-only models learn to represent the physical, non-linguistic world is an open question.We test a stronger hypothesis : that the conceptual representations learned by text-only models are functionally equivalent (up to a linear transformation) to those learned by models trained on vision tasks.Specifically, we show that the imagerepresentations from vision models can be transferred as continuous prompts to text-only models by training only a single linear projection.Using these to prompt the text-only language model achieves competitive performance on captioning and visual questionanswering tasks compared to models that tune both the image encoder and textdecoder (such as the magma model).We compare three image encoders with increasing amounts of linguistic supervision seen during pretraining : beit (no language information), nf-resnet (lexical category information), and clip (full natural language descriptions).We find that all three encoders perform equally well at transferring visual property information to the language model (e.g., whether an animal is large or small), but that image encoders pretrained with linguistic supervision more saliently encode category information (e.g.,distinguishing hippo vs.\ elephant) and thus perform significantly better on benchmark language-and-vision tasks.Our results indicate that text-only models encode conceptual information structurally similarly to vision-based models, even those that are solely trained on images.