Text to speech (TTS), or speech synthesis, which aims to synthesize
intelligible and natural speech given text, is a hot research topic in speech,
language, and machine learning communities and has broad applications in the
industry. As the development of deep learning and artificial intelligence,
neural network-based TTS has significantly improved the quality of synthesized
speech in recent years. In this paper, we conduct a comprehensive survey on
neural TTS, aiming to provide a good understanding of current research and
future trends. We focus on the key components in neural TTS, including text
analysis, acoustic models and vocoders, and several advanced topics, including
fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive TTS, etc.
We further summarize resources related to TTS (e.g., datasets, opensource
implementations) and discuss future research directions. This survey can serve
both academic researchers and industry practitioners working on TTS.