Plan Better Amid Conservatism: Offline Multi-Agent Reinforcement Learning with Actor Rectification
The idea of conservatism has led to significant progress in offline
reinforcement learning (RL) where an agent learns from pre-collected datasets.
However, it is still an open question to resolve offline RL in the more
practical multi-agent setting as many real-world scenarios involve interaction
among multiple agents. Given the recent success of transferring online RL
algorithms to the multi-agent setting, one may expect that offline RL
algorithms will also transfer to multi-agent settings directly. Surprisingly,
when conservatism-based algorithms are applied to the multi-agent setting, the
performance degrades significantly with an increasing number of agents. Towards
mitigating the degradation, we identify that a key issue that the landscape of
the value function can be non-concave and policy gradient improvements are
prone to local optima. Multiple agents exacerbate the problem since the
suboptimal policy by any agent could lead to uncoordinated global failure.
Following this intuition, we propose a simple yet effective method, Offline
Multi-Agent RL with Actor Rectification (OMAR), to tackle this critical
challenge via an effective combination of first-order policy gradient and
zeroth-order optimization methods for the actor to better optimize the
conservative value function. Despite the simplicity, OMAR significantly
outperforms strong baselines with state-of-the-art performance in multi-agent
continuous control benchmarks.