Object-centric representations are a promising path toward more systematic
generalization by providing flexible abstractions upon which compositional
world models can be built. Recent work on simple 2D and 3D datasets has shown
that models with object-centric inductive biases can learn to segment and
represent meaningful objects from the statistical structure of the data alone
without the need for any supervision. However, such fully-unsupervised methods
still fail to scale to diverse realistic data, despite the use of increasingly
complex inductive biases such as priors for the size of objects or the 3D
geometry of the scene. In this paper, we instead take a weakly-supervised
approach and focus on how 1) using the temporal dynamics of video data in the
form of optical flow and 2) conditioning the model on simple object location
cues can be used to enable segmenting and tracking objects in significantly
more realistic synthetic data. We introduce a sequential extension to Slot
Attention which we train to predict optical flow for realistic looking
synthetic scenes and show that conditioning the initial state of this model on
a small set of hints, such as center of mass of objects in the first frame, is
sufficient to significantly improve instance segmentation. These benefits
generalize beyond the training distribution to novel objects, novel
backgrounds, and to longer video sequences. We also find that such
initial-state-conditioning can be used during inference as a flexible interface
to query the model for specific objects or parts of objects, which could pave
the way for a range of weakly-supervised approaches and allow more effective
interaction with trained models.
Authors
Thomas Kipf, Gamaleldin F. Elsayed, Aravindh Mahendran, Austin Stone, Sara Sabour, Georg Heigold, Rico Jonschkowski, Alexey Dosovitskiy, Klaus Greff