Sound demos for "Non-Autoregressive Neural Text-to-Speech"

Authors: Kainan Peng*, Wei Ping*, Zhao Song*, Kexin Zhao*

* Equal contribution

Paper: arXiv. Published at ICML 2020.

Section I: Speech quality using various neural vocoders

We obtain synthesized speech from Deep Voice 3 and ParaNet using various neural vocoders, including WaveNet, distilled IAF vocoder, WaveVAE, and WaveGlow. We also obtain synthesized samples from a FastSpeech model (based on a
reimplementation) using the WaveGlow vocoder. For reference, we include some ground truth audios from our proprietary training dataset. Note that each ground truth audio contains different content than the text displayed immediately above it.

Deep Voice 3 + WaveNet ParaNet + WaveNet Ground truth (reference only)
1: Ask her to bring these things with her from the store.
2: We also need a small plastic snake and a big toy frog for the kids.
3: The rainbow is a division of white light into many beautiful colors.
4: People look but no one ever finds it.
5: Throughout the centuries people have explained the rainbow in various ways.

Deep Voice 3 + IAF (distilled) ParaNet + IAF (distilled)
1: Ask her to bring these things with her from the store.
2: We also need a small plastic snake and a big toy frog for the kids.
3: The rainbow is a division of white light into many beautiful colors.
4: People look but no one ever finds it.
5: Throughout the centuries people have explained the rainbow in various ways.

Deep Voice 3 + IAF (WaveVAE) ParaNet + IAF (WaveVAE)
1: Ask her to bring these things with her from the store.
2: We also need a small plastic snake and a big toy frog for the kids.
3: The rainbow is a division of white light into many beautiful colors.
4: People look but no one ever finds it.
5: Throughout the centuries people have explained the rainbow in various ways.

Deep Voice 3 + WaveGlow ParaNet + WaveGlow FastSpeech + WaveGlow
1: Ask her to bring these things with her from the store.
2: We also need a small plastic snake and a big toy frog for the kids.
3: The rainbow is a division of white light into many beautiful colors.
4: People look but no one ever finds it.
5: Throughout the centuries people have explained the rainbow in various ways.

Section II: Controlling the rate of speech

The non-autoregressive ParaNet can synthesize speech with different speech rates by specifying the position encoding rate and the length of output spectrogram, accordingly. See the following synthesized speech examples with slow, normal, and fast speech rates, respectively. We use WaveNet as the vocoder.

Slow Normal Fast
1: Six spoons of fresh snow peas five thick slabs of blue cheese and maybe a snack for her brother Bob.
2: We also need a small plastic snake and a big toy frog for the kids.