EfficientSpeech: An On-Device Text to Speech Model

EfficientSpeech, or ES for short, is an efficient neural text to speech (TTS) model. It generates mel spectrogram at a speed of 104 (mRTF) or 104 secs of speech per sec on an RPi4. Its tiny version has a footprint of just 266k parameters. Generating 6 secs of speech consumes 90 MFLOPS only.

Paper

IEEE Xplore

Model Architecture

EfficientSpeech is a shallow (2 blocks!) pyramid transformer resembling a U-Net. Upsampling is done by a transposed depth-wise separable convolution.

Quick Demo

Install

pip install -r requirements.txt

If you encountered problems with cublas:

pip uninstall nvidia_cublas_cu11

Tiny ES

python3 demo.py --checkpoint https://github.com/roatienza/efficientspeech/releases/download/icassp2023/tiny_eng_266k.ckpt \
  --infer-device cpu --text "the quick brown fox jumps over the lazy dog" --wav-filename fox.wav

Output file is under wav_outputs. Play the wav file:

ffplay wav_outputs/fox.wav-1.wav

After downloading the weights, it can be reused:

python3 demo.py --checkpoint tiny_eng_266k.ckpt --infer-device cpu  \
  --text "In additive color mixing, which is used for displays such as computer screens and televisions, the primary colors are red, green, and blue." \
  --wav-filename color.wav

Playback:

ffplay wav_outputs/color.wav-1.wav

Small ES

python3 demo.py --checkpoint https://github.com/roatienza/efficientspeech/releases/download/icassp2023/small_eng_952k.ckpt \
  --infer-device cpu  --n-blocks 3 --reduction 2  \
  --text "In subtractive color mixing, which is used for printing and painting, the primary colors are cyan, magenta, and yellow." \
  --wav-filename color-small.wav

Playback:

ffplay wav_outputs/color-small.wav-1.wav

Base ES

python3 demo.py --checkpoint  https://github.com/roatienza/efficientspeech/releases/download/icassp2023/base_eng_4M.ckpt \
  --head 2 --reduction 1 --expansion 2 --kernel-size 5 --n-blocks 3 --block-depth 3 --infer-device cpu  \
  --text " Why do bees have sticky hair?" --wav-filename  bees-base.wav

Playback:

ffplay wav_outputs/bees-base.wav-1.wav

GPU for Inference

python3 demo.py --checkpoint https://github.com/roatienza/efficientspeech/releases/download/icassp2023/small_eng_952k.ckpt \
  --infer-device cuda  --n-blocks 3 --reduction 2  \
  --text "In subtractive color mixing, which is used for printing and painting, the primary colors are cyan, magenta, and yellow."   \
  --wav-filename color-small.wav

Train

Data Preparation

Use the unofficial FastSpeech2 implementation to prepare the dataset.

Tiny ES

python3 train.py

Small ES

python3 train.py --n-blocks 3 --reduction 2

Base ES

python3 train.py --head 2 --reduction 1 --expansion 2 \
  --kernel-size 5 --n-blocks 3 --block-depth 3

Comparison with other SOTA Neural TTS

ES vs FS2 vs PortaSpeech vs LightSpeech

Credits

FastSpeech2 Unofficial Github

Citation

If you find this work useful, please cite:

@inproceedings{atienza2023efficientspeech,
  title={EfficientSpeech: An On-Device Text to Speech Model},
  author={Atienza, Rowel},
  booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1--5},
  year={2023},
  organization={IEEE}
}

stephenwithav / efficientspeech

EfficientSpeech: An On-Device Text to Speech Model

Paper

Model Architecture

Quick Demo

Train

Comparison with other SOTA Neural TTS

Credits

Citation

About

Languages