VL-InterpreT: Interactive Interpretability Tool for Vision and Multimodal Transformer-Based Models
VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers
We propose novel interactive visualizations for interpreting the attentions and hidden representations in multimodal transformers.
We demonstrate the functionalities of our tool through the analysis of an end-to-end pretraining vision-language multimodal-transformer-based model in the tasks of visual commonsense reasoning (vcr) and webqa, two visual question answering benchmarks.
We also present a few interesting findings about multimodal transformer behaviors that were learneded through our tool.
Authors
Estelle Aflalo, Meng Du, Shao-Yen Tseng, Yongfei Liu, Chenfei Wu, Nan Duan, Vasudev Lal