The best arm identification problem in the multi-armed bandit setting is an
excellent model of many real-world decision-making problems, yet it fails to
capture the fact that in the real-world, safety constraints often must be met
while learning. In this work we study the question of best-arm identification
in safety-critical settings, where the goal of the agent is to find the best
safe option out of many, while exploring in a way that guarantees certain,
initially unknown safety constraints are met. We first analyze this problem in
the setting where the reward and safety constraint takes a linear structure,
and show nearly matching upper and lower bounds. We then analyze a much more
general version of the problem where we only assume the reward and safety
constraint can be modeled by monotonic functions, and propose an algorithm in
this setting which is guaranteed to learn safely. We conclude with experimental
results demonstrating the effectiveness of our approaches in scenarios such as
safely identifying the best drug out of many in order to treat an illness.