An Empirical Study of the Most Important Factors in Video-and-Language Model Design
VindLU: A Recipe for Effective Video-and-Language Pretraining
This paper conducts a thorough empirical study demystifying the most important factors in the video-and-language (vidl) model design.
Among the factors that we investigate are (i)the spatiotemporal architecture design, (ii)the multimodal fusion schemes, (iii)the pretraining objectives, (iv)the choice of pretraining data, (v)pretraining and finetuning protocols, and (vi)the dataset and model scaling.
Using these empirical insights, we then develop a step-by-step recipe, dubbed vindlu, for effective vidlpretraining.
Our final model trained using our recipe achieves comparable or better than state-of-the-art results on several vidl tasks without relying on external clip pretraining.
Authors
Feng Cheng, Xizi Wang, Jie Lei, David Crandall, Mohit Bansal, Gedas Bertasius
The last few years have witnessed incredible progress in video-and-language (vidl) understanding and performance improvement.
However, the model architectures and pretraining/finetuning protocols used by modern vidl approaches have become significantly more complex and specialized over the last several years.
As a result, it is increasingly difficult to reproduce, analyze and compare most recent vidl frameworks.
By investigating various components in the vidl framework design by dissecting them along multiple dimensions, including temporal modeling schemes, multimodal fusion modules, pretraining objectives, pretraining datasets, and the number of frames for pretraining, finetuning and inference.
Based on this analysis, we observe that there exist significant differences among these vidl methods, making it challenging to reproduce, analyze and compare these methods.
Unfortunately, it is not clear which differences are important for the overall vidl performance and which are not.