To build general robotic agents that can operate in many environments, it is
often imperative for the robot to collect experience in the real world.
However, this is often not feasible due to safety, time, and hardware
restrictions. We thus propose leveraging the next best thing as real-world
experience: internet videos of humans using their hands. Visual priors, such as
visual features, are often learned from videos, but we believe that more
information from videos can be utilized as a stronger prior. We build a
learning algorithm, VideoDex, that leverages visual, action, and physical
priors from human video datasets to guide robot behavior. These actions and
physical priors in the neural network dictate the typical human behavior for a
particular robot task. We test our approach on a robot arm and dexterous
hand-based system and show strong results on various manipulation tasks,
outperforming various state-of-the-art methods. Videos at
this https URL