TraVLR: Now You See It, Now You Don't! Evaluating Cross-Modal Transfer of Visio-Linguistic Reasoning
Numerous visio-linguistic (V+L) representation learning methods have been
developed, yet existing datasets do not evaluate the extent to which they
represent visual and linguistic concepts in a unified space. Inspired by the
crosslingual transfer and psycholinguistics literature, we propose a novel
evaluation setting for V+L models: zero-shot cross-modal transfer. Existing V+L
benchmarks also often report global accuracy scores on the entire dataset,
rendering it difficult to pinpoint the specific reasoning tasks that models
fail and succeed at. To address this issue and enable the evaluation of
cross-modal transfer, we present TraVLR, a synthetic dataset comprising four
V+L reasoning tasks. Each example encodes the scene bimodally such that either
modality can be dropped during training/testing with no loss of relevant
information. TraVLR's training and testing distributions are also constrained
along task-relevant dimensions, enabling the evaluation of out-of-distribution
generalisation. We evaluate four state-of-the-art V+L models and find that
although they perform well on the test set from the same modality, all models
fail to transfer cross-modally and have limited success accommodating the
addition or deletion of one modality. In alignment with prior work, we also
find these models to require large amounts of data to learn simple spatial
relationships. We release TraVLR as an open challenge for the research
community.