Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP
Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during training.
Recent two-stage methods first generate class-agnostic mask proposals and then leverage pre-trained vision-language models, e.g., clip, to classify masked regions.
We identify the performance bottleneck of this paradigm to be the pre-trained model, since it does not perform well on masked images.
To address this, we propose to finetune the entire model on a collection of masked image regions and their corresponding text descriptions using a method we dub mask prompt tuning.
Experiments demonstrate mask prompt tuning brings significant improvement without modifying any weights of the pre-trained model, and it can further improve a fully finetuned model.
For the first time, open-vocabulary generalist models match the performance of supervised specialist models in 2017 withoutdataset-specific adaptations.
Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions.
Recent two-stage methods first generate class-agnostic mask proposals and then leverage pre-trained vision-language models, e.g., clip, to classify masked regions.
We identify the performance bottleneck of this paradigm to be the pre-trained model, since it does not perform well on masked images.
To address this, we propose to finetune the entire model, using a method we dub mask prompt tuning.
Authors
Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, Diana Marculescu