MuLan: A Joint Audio-Text Embedding Model for Music Retrieval
MuLan: A Joint Embedding of Music Audio and Natural Language
We present the first attempt at a new generation of acoustic models that link music audio directly to unconstrained natural language music descriptions.
The resulting audio-text representation subsumes existing ontologies while graduating to true zero-shot functionalities.
We demonstrate the versatility of the embeddings with a range of experiments including transfer learning, zero-shot music tagging, language understanding in the music domain, and cross-modal retrieval applications.
Authors
Qingqing Huang, Aren Jansen, Joonseok Lee, Ravi Ganti, Judith Yue Li, Daniel P. W. Ellis