Improving the Quality of End-to-End Speech Recognition
A Better and Faster End-to-End Model for Streaming ASR
We explore the tradeoff between quality and latency for streaming end-to-end (endpointer) speech recognition (as measured by word error rate and endpointer latency).
We explore encouraging the endpointer to emit words early, through an algorithm called fastemit, and also explore running a 2nd-pass beam search to improve quality.
In addition, we explore non-causal conformer layers that feed into the same 1st-pass 1st-pass rnn-t decoder, an algorithm called cascadedencoders.
Overall, we find that the combination of fast emit and non-causal conformer layers offers a better quality and latency tradeoff for streaming end-to-end speech recognition (as measured by word error rate and endpointer latency).