BEATs: Audio Pre-Training with Acoustic Tokenizers
The massive growth of self-supervised learning (SSL) has been witnessed in
language, vision, speech, and audio domains over the past few years. While
discrete label prediction is widely adopted for other modalities, the
state-of-the-art audio SSL models still employ reconstruction loss for
pre-training. Compared with reconstruction loss, semantic-rich discrete label
prediction encourages the SSL model to abstract the high-level audio semantics
and discard the redundant details as in human perception. However, a
semantic-rich acoustic tokenizer for general audio pre-training is usually not
straightforward to obtain, due to the continuous property of audio and
unavailable phoneme sequences like speech. To tackle this challenge, we propose
BEATs, an iterative audio pre-training framework to learn Bidirectional Encoder
representation from Audio Transformers, where an acoustic tokenizer and an
audio SSL model are optimized by iterations. In the first iteration, we use
random projection as the acoustic tokenizer to train an audio SSL model in a
mask and label prediction manner. Then, we train an acoustic tokenizer for the
next iteration by distilling the semantic knowledge from the pre-trained or
fine-tuned audio SSL model. The iteration is repeated with the hope of mutual
promotion of the acoustic tokenizer and audio SSL model. The experimental
results demonstrate our acoustic tokenizers can generate discrete labels with
rich audio semantics and our audio SSL models achieve state-of-the-art results
across various audio classification benchmarks, even outperforming previous
models that use more training data and model parameters significantly.
Specifically, we set a new state-of-the-art mAP 50.6% on AudioSet-2M for
audio-only models without using any external data, and 98.1% accuracy on
ESC-50. The code and pre-trained models are available at this https URL