Guided-TTS:Text-to-Speech with Untranscribed Speech
Most neural text-to-speech (TTS) models require <speech, transcript> paired
data from the desired speaker for high-quality speech synthesis, which limits
the usage of large amounts of untranscribed data for training. In this work, we
present Guided-TTS, a high-quality TTS model that learns to generate speech
from untranscribed speech data. Guided-TTS combines an unconditional diffusion
probabilistic model with a separately trained phoneme classifier for
text-to-speech. By modeling the unconditional distribution for speech, our
model can utilize the untranscribed data for training. For text-to-speech
synthesis, we guide the generative process of the unconditional DDPM via
phoneme classification to produce mel-spectrograms from the conditional
distribution given transcript. We show that Guided-TTS achieves comparable
performance with the existing methods without any transcript for LJSpeech. Our
results further show that a single speaker-dependent phoneme classifier trained
on multispeaker large-scale data can guide unconditional DDPMs for various
speakers to perform TTS.