Adversarially Robust Multi-Armed Bandit Algorithm with Variance-Dependent Regret Bounds
This paper considers the multi-armed bandit (MAB) problem and provides a new
best-of-both-worlds (BOBW) algorithm that works nearly optimally in both
stochastic and adversarial settings. In stochastic settings, some existing BOBW
algorithms achieve tight gap-dependent regret bounds of $O(\sum_{i: \Delta_i>0}
\frac{\log T}{\Delta_i})$ for suboptimality gap $\Delta_i$ of arm $i$ and time
horizon $T$. As Audibert et al. [2007] have shown, however, that the
performance can be improved in stochastic environments with low-variance arms.
In fact, they have provided a stochastic MAB algorithm with
gap-variance-dependent regret bounds of $O(\sum_{i: \Delta_i>0}
(\frac{\sigma_i^2}{\Delta_i} + 1) \log T )$ for loss variance $\sigma_i^2$ of
arm $i$. In this paper, we propose the first BOBW algorithm with
gap-variance-dependent bounds, showing that the variance information can be
used even in the possibly adversarial environment. Further, the leading
constant factor in our gap-variance dependent bound is only (almost) twice the
value for the lower bound. Additionally, the proposed algorithm enjoys multiple
data-dependent regret bounds in adversarial settings and works well in
stochastic settings with adversarial corruptions. The proposed algorithm is
based on the follow-the-regularized-leader method and employs adaptive learning
rates that depend on the empirical prediction error of the loss, which leads to
gap-variance-dependent regret bounds reflecting the variance of the arms.