Unsupervised Learning of Visual 3D Keypoints for Control
Learning sensorimotor control policies from high-dimensional images crucially
relies on the quality of the underlying visual representations. Prior works
show that structured latent space such as visual keypoints often outperforms
unstructured representations for robotic control. However, most of these
representations, whether structured or unstructured are learned in a 2D space
even though the control tasks are usually performed in a 3D environment. In
this work, we propose a framework to learn such a 3D geometric structure
directly from images in an end-to-end unsupervised manner. The input images are
embedded into latent 3D keypoints via a differentiable encoder which is trained
to optimize both a multi-view consistency loss and downstream task objective.
These discovered 3D keypoints tend to meaningfully capture robot joints as well
as object movements in a consistent manner across both time and 3D space. The
proposed approach outperforms prior state-of-art methods across a variety of
reinforcement learning benchmarks. Code and videos at
this https URL