GST-Tacotron 2

An emotional speech synthesis research project conducted as part of IS4152 coursework. This repository contains code that can be used to train a speech synthesis model that attempts to generate speech-like sounds to express a chosen emotion.

Pre-requisites

NVIDIA GPU + CUDA cuDNN

Set up repository

Clone this repo: git clone https://github.com/taneliang/mellotron.git
CD into this repo: cd mellotron
Initialize submodule: git submodule init; git submodule update

Set up dependencies

Check CUDA toolkit version: nvcc --version. NB: This is the toolkit version, which may be different from the version reported by nvidia-smi.
Create Python 3 virtual environment: python3 -m venv .env-cuda<CUDA version>
Activate venv, by running one of the following:
- bash/sh: source .env-cudaxxx/bin/activate
- csh: source .env-cudaxxx/bin/activate.csh
- fish: source .env-cudaxxx/bin/activate.fish
Install [PyTorch 1.0]. As the time this was written, these are the instructions:
- CUDA 10.0: pip install torch==1.4.0+cu100 torchvision==0.5.0+cu100 -f https://download.pytorch.org/whl/cu100/torch_stable.html
- CUDA 10.1: pip install torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html
- CUDA 10.2 or 11.0: pip install torch torchvision

Install Apex:

pushd ..
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
popd

Install Python requirements: pip install -r requirements.txt

Set up data for training

EmoV-DB:
1. Download the EmoV-DB dataset
2. Normalize it: ls */*/*.wav | xargs -I % sh -c 'mkdir -p ../out/$(dirname %) && sox % --rate 16000 -c 1 -b 16 ../out/%'
3. Trim leading and trailing silences: ls */*/*.wav | xargs -I @ sh -c 'mkdir -p ../out-no-silence/$(dirname @) && sox @ --rate 16000 -c 1 -b 16 ../out-no-silence/@ silence 1 0.1 1% reverse silence 1 0.1 1% reverse'
4. (Optional) Manually trim non-verbal expressions:
  1. Generate a CSV file to be manually filled in with trim timestamps: ./genmanualtrimlist.py
  2. Use the CSV file to trim files: ./createcleanemovdb.py
LJSpeech:
1. Download the LJSpeech dataset.
2. Normalize it: mkdir ../../LJSpeech-1.1/wavs && ls *.wav | xargs -I % sh -c 'sox % --rate 16000 -c 1 -b 16 ../../LJSpeech-1.1/wavs/%'

Generate filelist files:

cd scripts
vim ./genfilelist.py # Configure the script before running
./genfilelist.py
cd ..

Training

Update the filelists inside the filelists folder to point to your data
python train.py --output_directory=outdir --log_directory=logdir
(OPTIONAL) tensorboard --logdir=outdir/logdir

Training using a pre-trained model

Training using a pre-trained model can lead to faster convergence
By default, the emotion embedding layer is ignored

python train.py --output_directory=outdir --log_directory=logdir -c models/mellotron_libritts.pt --warm_start

Multi-GPU (distributed) and Automatic Mixed Precision Training

python -m multiproc train.py --output_directory=outdir --log_directory=logdir --hparams=distributed_run=True,fp16_run=True

Inference demo

jupyter notebook --ip=127.0.0.1 --port=31337
Load inference.ipynb
(optional) Download our published WaveGlow model

Related repos

WaveGlow Faster than real time Flow-based Generative Network for Speech Synthesis.

Acknowledgements

This project is a slight modification of Mellotron, developed by Rafael Valle, Jason Li, Ryan Prenger and Bryan Catanzaro.

In turn, Mellotron uses code from the following repos: Keith Ito, Prem Seetharaman, Chengqi Deng, Patrice Guyot, as described in our code.

taneliang / gst-tacotron2