Contrastive Language-Audio Pretraining for Multimodal Representation Learning
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation
In this paper, we propose a pipeline of contrastivelanguage-audio pretraining to develop an audio representation by combining audio data with natural language descriptions.
Second, we construct a contrastive language-audiopretraining model by considering different audio encoders and text encoders.
We incorporate the feature fusion mechanism and keyword-to-caption augmentation into the model design to further enable the model to process audio inputs of variable lengths and enhance the performance.
Third, we perform comprehensive experiments to evaluate our model across three tasks : text-to-audio retrieval, zero-shot audio classification, and supervised audio classification.
The results demonstrate that our model achieves superior performance in text-to-audio retrieval task and achieves state-of-the-art performance in audio classification tasks.
Authors
Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, Shlomo Dubnov
The contrastive learning paradigm is a successful solution for training a model on large-scale noisy data collected from internet.
Audio is one of the most common information types in the world alongside text and image data.
Similarly to vision, audio and natural languages also contain overlapping information that could be learned together with the related audio to form an audio representation of crossmodal information.
However, different audio tasks typically require finely-annotated data, which limits the amount of available audio data due to the labor-intensive collection procedure.
Consequently, designing an effective audio representation for many audio tasks without requiring a lot of supervision remains a challenge.
Recently proposed contrastive language-audio pretraining model has shown great success in downstream tasks such as text-to-image retrieval and text-guided captioning.
This paper investigates the effectiveness of the learned representation in the downstream task of audio classification.
Result
In this paper, we propose a large-scale audio-text dataset and improvements on current language-audio contrastive learning paradigm.
We show that laion-audio-630, audioset with keyword-to-caption augmentation, and feature fusion effectively leads to better audio understanding, task performance, and enables effective learnings on variable-length data.