We present a framework for efficient perceptual inference that explicitly
reasons about the segmentation of its inputs and features. Rather than being
trained for any specific segmentation, our framework learns the grouping
process in an unsupervised manner or alongside any supervised task. By
enriching the representations of a neural network, we enable it to group the
representations of different objects in an iterative manner. By allowing the
system to amortize the iterative inference of the groupings, we achieve very
fast convergence. In contrast to many other recently proposed methods for
addressing multi-object scenes, our system does not assume the inputs to be
images and can therefore directly handle other modalities. For multi-digit
classification of very cluttered images that require texture segmentation, our
method offers improved classification performance over convolutional networks
despite being fully connected. Furthermore, we observe that our system greatly
improves on the semi-supervised result of a baseline Ladder network on our
dataset, indicating that segmentation can also improve sample efficiency.
Authors
Klaus Greff, Antti Rasmus, Mathias Berglund, Tele Hotloo Hao, Jürgen Schmidhuber, Harri Valpola