We show that self-supervised visual pre-training from real-world images is effective for learning motor control tasks from pixels.
Without relying on labels, state-estimation, or expert demonstrations, we consistently outperform supervised encoders by up to 80% absolute success rate, sometimes even matching the oracle state performance.
To accelerate progress in learning from pixels, we contribute a benchmark suite of hand-designed tasksvarying in movements, scenes, and robots.
Authors
Tete Xiao, Ilija Radosavovic, Trevor Darrell, Jitendra Malik