Hierarchical Embeddings for Audio Separation
Hyperbolic Audio Source Separation
We introduce a framework for audio source separation using embeddings on a hyperbolic manifold that compactly represent the hierarchical relationshipbetween sound sources and time-frequency features.On a synthetic dataset containing mixtures of multiple people talking and musical instruments playing, our hyperbolic model performed comparable to a euclidean baseline in terms of source to distortion ratio, with stronger performance at low embedding dimensions.Furthermore, we find that time-frequency regions containing multiple overlapping sources are embeddedtowards the center (i.e., the most uncertain region) of the hyperbolic space, and we can use this certainty estimate to efficiently trade-off between artifact introduction and interference reduction when isolating individual sounds.
Illustration of hyperbolic source separation. Left plot demonstrates the process of taking a T-F bin from a mixture spectrogram