PyramidCLIP: A Multi-Semantic Vision-Language Pre-training Approach
PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining
Large-scale vision-language pre-training has achieved promising results on downstream tasks.
Here we introduce pyramidclip, which constructs an input pyramid with different semantic levels, and aligns visual elements and linguistic elements in the form of hierarchy via intra-level semantics alignment and cross-level relation alignment.
Furthermore, we adjust the objective function by softening the loss of negative samples (unpaired samples) so as to weaken the strict constraint during the pre-training stage, thus mitigating the risk of the model being over-confident.
Experiments on three downstream tasks, including zero-shot image classification, zero-shot image-text retrieval and image object detection, verify the effectiveness of the proposed pyramidclip.
In particular, with the same amount of pre-training data of 15 millions image-text pairs, pyramidclip exceeds clip by 19.2%/18.5%/19.6% respectively, with the image encoder being resnet-50/vit-b32/vit-b16 on image classification top-1 accuracy.
Authors
Yuting Gao, Jinfeng Liu, Zihan Xu, Jun Zhang, Ke Li, Chunhua Shen