On the Harmful Effects of Implicit Regularization in Deep Reinforcement Learning
DR3: Value-Based Deep Reinforcement Learning Requires Explicit Regularization
We discuss how the implicit regularization effect of stochastic gradient descent seen in supervised learning could in fact be harmful in the offline deep reinforcement learning setting, leading to poor generalization and degenerate feature representations.
Our theoretical analysis shows that when existing models of implicit regularization are applied to temporal difference learning, the derived regularizer favors degenerate solutions with excessive"aliasing", in stark contrast to the supervised learning case.
We back up these findings empirically, showing that feature representations learned by a deep network value function trained via bootstrapping can indeed become degenerate, with the representations for state-action pairs that appear on either sideof the bellman backup.
To address this issue, we derive the form of this implicit regularizer and, inspired by this derivation, propose a simple and effective explicit regularizer, called dr3, that counteracts the undesirable effects of this implicit regularizer.
When combined with existing offline deep reinforcement learning methods, dr3 substantially improves performance and stability, alleviating unlearning in atari 2600 games, domains and robotic manipulation from images.