OptiSpeech is ment to be an ultra efficient, lightweight and fast text-to-speech model for on-device text-to-speech.
I would like to thank Pneuma Solutions for providing GPU resources for training this model. Their support significantly accelerated my development process.
OptiSpeech-ConvNeXtTTS-run1.mp4
Note that this is still WIP. Final model designed decisions are still being made.
$ git clone https://github.com/mush42/optispeech
$ cd optispeech
$ python3 -m venv .venv
$ source .venv/bin/activate
$ pip3 install --upgrade pip setuptools wheels
$ pip3 install -r requirements.txt
$ python3 -m optispeech.infer --help
usage: infer.py [-h] [--d-factor D_FACTOR] [--p-factor P_FACTOR] [--e-factor E_FACTOR] [--cuda]
checkpoint text output_dir
Speaking text using OptiSpeech
positional arguments:
checkpoint Path to OptiSpeech checkpoint
text Text to synthesise
output_dir Directory to write generated audio to.
options:
-h, --help show this help message and exit
--d-factor D_FACTOR Scale to control speech rate
--p-factor P_FACTOR Scale to control pitch
--e-factor E_FACTOR Scale to control energy
--cuda Use GPU for inference
import soundfile as sf
from optispeech.model import OptiSpeech
# Load model
device = torch.device("cpu")
ckpt_path = "/path/to/checkpoint"
model = OptiSpeech.load_from_checkpoint(ckpt_path, map_location="cpu")
model = model.to(device)
model = model.eval()
# Text preprocessing and phonemization
sentence = "A rainbow is a meteorological phenomenon that is caused by reflection, refraction and dispersion of light in water droplets resulting in a spectrum of light appearing in the sky."
x, x_lengths, clean_text = model.prepare_input(sentence)
# Inference
synth_outputs = model.synthesize(x, x_lengths)
wav = synth_outputs["wav"]
sf.write("output.wav", wav.squeeze().detach().cpu().numpy(), model.sample_rate)
Since this code uses Lightning-Hydra-Template, you have all the powers that come with it.
Training is easy as 1, 2:
Given a dataset that is organized as follows:
├── train
│ ├── metadata.csv
│ └── wav
│ ├── aud-00001-0003.wav
│ └── ...
└── val
├── metadata.csv
└── wav
├── aud-00764.wav
└── ...
Use the preprocess_dataset
script to prepare the dataset for training:
$ python3 -m optispeech.tools.preprocess_dataset --help
usage: preprocess_dataset.py [-h] [--format {ljspeech}] dataset input_dir output_dir
positional arguments:
dataset dataset config relative to `configs/data/` (without the suffix)
input_dir original data directory
output_dir Output directory to write datafiles + train.txt and val.txt
options:
-h, --help show this help message and exit
--format {ljspeech} Dataset format.
If you are training on a new dataset, you must calculate and add **data_statistics ** using the following script:
$ python3 -m optispeech.tools.generate_data_statistics --help
usage: generate_data_statistics.py [-h] [-b BATCH_SIZE] [-f] [-o OUTPUT_DIR] input_config
positional arguments:
input_config The name of the yaml config file under configs/data
options:
-h, --help show this help message and exit
-b BATCH_SIZE, --batch-size BATCH_SIZE
Can have increased batch size for faster computation
-f, --force force overwrite the file
-o OUTPUT_DIR, --output-dir OUTPUT_DIR
Output directory to save the data statistics
To start training run the following command. Note that this training run uses config from hfc_female-en_US. You can copy and update it with your own config values, and pass the name of the custom config file (without extension) instead.
$ python3 -m optispeech.train experiment=hfc_female-en_us
$ python3 -m optispeech.onnx.export --help
usage: export.py [-h] [--opset OPSET] [--seed SEED] checkpoint_path output
Export OptiSpeech checkpoints to ONNX
positional arguments:
checkpoint_path Path to the model checkpoint
output Path to output `.onnx` file
options:
-h, --help show this help message and exit
--opset OPSET ONNX opset version to use (default 15
--seed SEED Random seed
$ python3 -m optispeech.onnx.infer --help
usage: infer.py [-h] [--d-factor D_FACTOR] [--p-factor P_FACTOR] [--e-factor E_FACTOR] [--cuda]
onnx_path text output_dir
ONNX inference of OptiSpeech
positional arguments:
onnx_path Path to the exported LeanSpeech ONNX model
text Text to speak
output_dir Directory to write generated audio to.
options:
-h, --help show this help message and exit
--d-factor D_FACTOR Scale to control speech rate.
--p-factor P_FACTOR Scale to control pitch.
--e-factor E_FACTOR Scale to control energy.
--cuda Use GPU for inference
Repositories I would like to acknowledge:
- BetterFastspeech2: For repo backbone
- LightSpeech: for the transformer backbone
- JETS: for the phoneme-mel alignment framework
- Vocos: For pioneering the use of ConvNext in TTS
- Piper-TTS: For leading the charge in on-device TTS. Also for the great phonemizer
@inproceedings{luo2021lightspeech,
title={Lightspeech: Lightweight and fast text to speech with neural architecture search},
author={Luo, Renqian and Tan, Xu and Wang, Rui and Qin, Tao and Li, Jinzhu and Zhao, Sheng and Chen, Enhong and Liu, Tie-Yan},
booktitle={ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={5699--5703},
year={2021},
organization={IEEE}
}
@article{siuzdak2023vocos,
title={Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis},
author={Siuzdak, Hubert},
journal={arXiv preprint arXiv:2306.00814},
year={2023}
}
@INPROCEEDINGS{10446890,
author={Okamoto, Takuma and Ohtani, Yamato and Toda, Tomoki and Kawai, Hisashi},
booktitle={ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={Convnext-TTS And Convnext-VC: Convnext-Based Fast End-To-End Sequence-To-Sequence Text-To-Speech And Voice Conversion},
year={2024},
volume={},
number={},
pages={12456-12460},
keywords={Vocoders;Neural networks;Signal processing;Transformers;Real-time systems;Acoustics;Decoding;ConvNeXt;JETS;text-to-speech;voice conversion;WaveNeXt},
doi={10.1109/ICASSP48485.2024.10446890}
}
Copyright (c) Musharraf Omer. MIT Licence. See LICENSE for more details.