Adversarial Policies Beat Professional-Level Go AIs
We attack the state-of-the-art Go-playing AI system, KataGo, by training an
adversarial policy that plays against a frozen KataGo victim. Our attack
achieves a >99% win-rate against KataGo without search, and a >50% win-rate
when KataGo uses enough search to be near-superhuman. To the best of our
knowledge, this is the first successful end-to-end attack against a Go AI
playing at the level of a top human professional. Notably, the adversary does
not win by learning to play Go better than KataGo -- in fact, the adversary is
easily beaten by human amateurs. Instead, the adversary wins by tricking KataGo
into ending the game prematurely at a point that is favorable to the adversary.
Our results demonstrate that even professional-level AI systems may harbor
surprising failure modes. See this https URL for example
games.
Authors
Tony Tong Wang, Adam Gleave, Nora Belrose, Tom Tseng, Joseph Miller, Michael D Dennis, Yawen Duan, Viktor Pogrebniak, Sergey Levine, Stuart Russell