VICRegL: Self-Supervised Learning of Local Visual Features
Most recent self-supervised methods for learning image representations focus
on either producing a global feature with invariance properties, or producing a
set of local features. The former works best for classification tasks while the
latter is best for detection and segmentation tasks. This paper explores the
fundamental trade-off between learning local and global features. A new method
called VICRegL is proposed that learns good global and local features
simultaneously, yielding excellent performance on detection and segmentation
tasks while maintaining good performance on classification tasks. Concretely,
two identical branches of a standard convolutional net architecture are fed two
differently distorted versions of the same image. The VICReg criterion is
applied to pairs of global feature vectors. Simultaneously, the VICReg
criterion is applied to pairs of local feature vectors occurring before the
last pooling layer. Two local feature vectors are attracted to each other if
their l2-distance is below a threshold or if their relative locations are
consistent with a known geometric transformation between the two input images.
We demonstrate strong performance on linear classification and segmentation
transfer tasks. Code and pretrained models are publicly available at:
this https URL