Zero-Shot Detection via Vision and Language Knowledge Distillation
Zero-shot image classification has made promising progress by training the
aligned image and text encoders. The goal of this work is to advance zero-shot
object detection, which aims to detect novel objects without bounding box nor
mask annotations. We propose ViLD, a training method via Vision and Language
knowledge Distillation. We distill the knowledge from a pre-trained zero-shot
image classification model (e.g., CLIP) into a two-stage detector (e.g., Mask
R-CNN). Our method aligns the region embeddings in the detector to the text and
image embeddings inferred by the pre-trained model. We use the text embeddings
as the detection classifier, obtained by feeding category names into the
pre-trained text encoder. We then minimize the distance between the region
embeddings and image embeddings, obtained by feeding region proposals into the
pre-trained image encoder. During inference, we include text embeddings of
novel categories into the detection classifier for zero-shot detection. We
benchmark the performance on LVIS dataset by holding out all rare categories as
novel categories. ViLD obtains 16.1 mask AP$_r$ with a Mask R-CNN (ResNet-50
FPN) for zero-shot detection, outperforming the supervised counterpart by 3.8.
The model can directly transfer to other datasets, achieving 72.2 AP$_{50}$,
36.6 AP and 11.8 AP on PASCAL VOC, COCO and Objects365, respectively.