A 3D Convolutional Approach to Spectral Object Segmentation in Space and Time
Elena Burceanu, Marius Leordeanu
We formulate object segmentation in video as a graph partitioning problem in
space and time, in which nodes are pixels and their relations form local
neighborhoods. We claim that the strongest cluster in this pixel-level graph
represents the salient object segmentation. We compute the main cluster using a
novel and fast 3D filtering technique that finds the spectral clustering
solution, namely the principal eigenvector of the graph's adjacency matrix,
without building the matrix explicitly - which would be intractable. Our method
is based on the power iteration for finding the principal eigenvector of a
matrix, which we prove is equivalent to performing a specific set of 3D
convolutions in the space-time feature volume. This allows us to avoid creating
the matrix and have a fast parallel implementation on GPU. We show that our
method is much faster than classical power iteration applied directly on the
adjacency matrix. Different from other works, ours is dedicated to preserving
object consistency in space and time at the level of pixels. For that, it
requires powerful pixel-wise features at the frame level. This makes it
perfectly suitable for incorporating the output of a backbone network or other
methods and fast-improving over their solution without supervision. In
experiments, we obtain consistent improvement, with the same set of
hyper-parameters, over the top state of the art methods on DAVIS-2016 dataset,
both in unsupervised and semi-supervised tasks. We also achieve top results on
the well-known SegTrackv2 dataset.