Self-supervised Learning from a Multi-view Perspective
As a subset of unsupervised representation learning, self-supervised
representation learning adopts self-defined signals as supervision and uses the
learned representation for downstream tasks, such as object detection and image
captioning. Many proposed approaches for self-supervised learning follow
naturally a multi-view perspective, where the input (e.g., original images) and
the self-supervised signals (e.g., augmented images) can be seen as two
redundant views of the data. Building from this multi-view perspective, this
paper provides an information-theoretical framework to better understand the
properties that encourage successful self-supervised learning. Specifically, we
demonstrate that self-supervised learned representations can extract
task-relevant information and discard task-irrelevant information. Our
theoretical framework paves the way to a larger space of self-supervised
learning objective design. In particular, we propose a composite objective that
bridges the gap between prior contrastive and predictive learning objectives,
and introduce an additional objective term to discard task-irrelevant
information. To verify our analysis, we conduct controlled experiments to
evaluate the impact of the composite objectives. We also explore our
framework's empirical generalization beyond the multi-view perspective, where
the cross-view redundancy may not be clearly observed.
Authors
Yao-Hung Hubert Tsai, Yue Wu, Ruslan Salakhutdinov, Louis-Philippe Morency