Temporal Modeling via Video Transformer and Masked Visual-token Modeling
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling
We present a fully end-to-end video-language video-transformer, which adopts a videotransformer to explicitly model the temporal dynamics of video inputs.
Furthermore, unlike previous studies that found pre-training tasks on video inputs (e.g.,masked frame modeling) not very effective, we design a new pre-training task, masked visual-token modeling (mvm), for better video modeling.
Specifically, the original video frame patches are"tokenized"into discrete visual tokens, and the goal is to recover the original visual tokens based on the masked patches.
Comprehensive analysis demonstrates the effectiveness of both explicit temporal modeling via video transformer and mvm.
As a result, we achieve new state-of-the-art performance on 5 video question answering tasks and 4 text-to-video retrieval tasks.
Authors
Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, Zicheng Liu