The ability to learn universal audio representations that can solve diverse
speech, music, and environment tasks can spur many applications that require
general sound content understanding. In this work, we introduce a holistic
audio representation evaluation suite (HARES) spanning 12 downstream tasks
across audio domains and provide a thorough empirical study of recent sound
representation learning systems on that benchmark. We discover that previous
sound event classification or speech models do not generalize outside of their
domains. We observe that more robust audio representations can be learned with
the SimCLR objective; however, the model's transferability depends heavily on
the model architecture. We find the Slowfast architecture is good at learning
rich representations required by different domains, but its performance is
affected by the normalization scheme. Based on these findings, we propose a
novel normalizer-free Slowfast NFNet and achieve state-of-the-art performance
across all domains.
Authors
Luyu Wang, Pauline Luc, Yan Wu, Adria Recasens, Lucas Smaira, Andrew Brock, Andrew Jaegle, Jean-Baptiste Alayrac, Sander Dieleman, Joao Carreira, Aaron van den Oord