How to Leverage Unlabeled Data in Offline Reinforcement Learning
Offline reinforcement learning (RL) can learn control policies from static
datasets but, like standard RL methods, it requires reward annotations for
every transition. In many cases, labeling large datasets with rewards may be
costly, especially if those rewards must be provided by human labelers, while
collecting diverse unlabeled data might be comparatively inexpensive. How can
we best leverage such unlabeled data in offline RL? One natural solution is to
learn a reward function from the labeled data and use it to label the unlabeled
data. In this paper, we find that, perhaps surprisingly, a much simpler method
that simply applies zero rewards to unlabeled data leads to effective data
sharing both in theory and in practice, without learning any reward model at
all. While this approach might seem strange (and incorrect) at first, we
provide extensive theoretical and empirical analysis that illustrates how it
trades off reward bias, sample complexity and distributional shift, often
leading to good results. We characterize conditions under which this simple
strategy is effective, and further show that extending it with a simple
reweighting approach can further alleviate the bias introduced by using
incorrect reward labels. Our empirical evaluation confirms these findings in
simulated robotic locomotion, navigation, and manipulation settings.
Authors
Tianhe Yu, Aviral Kumar, Yevgen Chebotar, Karol Hausman, Chelsea Finn, Sergey Levine