dGSLM: Generating Audio Samples of Natural Conversational Dialogues
Generative Spoken Dialogue Language Modeling
We introduce dgslm, the first"textless"model able to generate audio samples of naturalistic spoken dialogues.
It uses recent work on unsupervised spokenunit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio(fisher dataset) without any text or labels.
It is able to generate speech,laughter and other paralinguistic signals in the two channels simultaneously and reproduces naturalistic turn taking.
Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Benoit Sagot, Abdelrahman Mohamed, Emmanuel Dupoux