ViCE: Self-Supervised Visual Concept Embeddings as Contextual and Pixel Appearance Invariant Semantic Representations
This work presents a self-supervised method to learn dense semantically rich
visual concept embeddings for images inspired by methods for learning word
embeddings in NLP. Our method improves on prior work by generating more
expressive embeddings and by being applicable for high-resolution images.
Viewing the generation of natural images as a stochastic process where a set of
latent visual concepts give rise to observable pixel appearances, our method is
formulated to learn the inverse mapping from pixels to concepts. Our method
greatly improves the effectiveness of self-supervised learning for dense
embedding maps by introducing superpixelization as a natural hierarchical step
up from pixels to a small set of visually coherent regions. Additional
contributions are regional contextual masking with nonuniform shapes matching
visually coherent patches and complexity-based view sampling inspired by masked
language models. The enhanced expressiveness of our dense embeddings is
demonstrated by significantly improving the state-of-the-art representation
quality benchmarks on COCO (+12.94 mIoU, +87.6\%) and Cityscapes (+16.52 mIoU,
+134.2\%). Results show favorable scaling and domain generalization properties
not demonstrated by prior work.