Illustration of hyperbolic source separation. Left plot demonstrates the process of taking a T-F bin from a mixture spectrogram
We introduce a framework for audio source separation using embeddings on a hyperbolic manifold that compactly represent the hierarchical relationshipbetween sound sources and time-frequency features.
On a synthetic dataset containing mixtures of multiple people talking and musical instruments playing, our hyperbolic model performed comparable to a euclidean baseline in terms of source to distortion ratio, with stronger performance at low embedding dimensions.
Furthermore, we find that time-frequency regions containing multiple overlapping sources are embeddedtowards the center (i.e., the most uncertain region) of the hyperbolic space, and we can use this certainty estimate to efficiently trade-off between artifact introduction and interference reduction when isolating individual sounds.
Authors
Darius Petermann, Gordon Wichern, Aswin Subramanian, Jonathan Le Roux
Deep learning-based audio source separation algorithms are based on the idea of applying a mask to a feature representation (e.g., a magnitude spectrogram or learned basis) of an audio mixture signal.
By inverting or decoding the masked feature representation, we obtain the isolated time-domain source signals.
While techniques that learn feature encoders and decoders directly based on waveform signals have achieved impressive performance, they lack interpretability compared to techniques based on time-frequency (t-f) representations such as the short-time fourier transform (stft) spectrogram.
In this work, we take inspiration from recent advances in modeling language, graphs, and images in hyperbolic space, and explore their relevance for audio source separation.
Result
We investigate the use of the poincaré ball model to perform audio source separation in the hyperbolic space.
Our hyperbolic model operates and computes transverse-frequency (t-f) embeddings in the euclidean space and projects them onto a hyperbolic hyperspace.
Masks are obtained by hyperbolic multinomial logistic regression considering the distance from hyperbolic embeddings to hyperbolic hyperplanes.
We associated these notions to known audio concepts such as artifacts and interferences.