VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
We present VideoCLIP, a contrastive approach to pre-train a unified model for
zero-shot video and text understanding, without using any labels on downstream
tasks. VideoCLIP trains a transformer for video and text by contrasting
temporally overlapping positive video-text pairs with hard negatives from
nearest neighbor retrieval. Our experiments on a diverse series of downstream
tasks, including sequence-level text-video retrieval, VideoQA, token-level
action localization, and action segmentation reveal state-of-the-art
performance, surpassing prior work, and in some cases even outperforming
supervised approaches. Code is made available at
this https URL