An emotional speech synthesis research project conducted as part of IS4152 coursework. This repository contains code that can be used to train a speech synthesis model that attempts to generate speech-like sounds to express a chosen emotion.
- NVIDIA GPU + CUDA cuDNN
- Clone this repo:
git clone https://github.com/taneliang/mellotron.git
- CD into this repo:
cd mellotron
- Initialize submodule:
git submodule init; git submodule update
- Check CUDA toolkit version:
nvcc --version
. NB: This is the toolkit version, which may be different from the version reported by nvidia-smi. - Create Python 3 virtual environment:
python3 -m venv .env-cuda<CUDA version>
- Activate venv, by running one of the following:
bash
/sh
:source .env-cudaxxx/bin/activate
csh
:source .env-cudaxxx/bin/activate.csh
fish
:source .env-cudaxxx/bin/activate.fish
- Install [PyTorch 1.0]. As the time this was written, these are the instructions:
- CUDA 10.0:
pip install torch==1.4.0+cu100 torchvision==0.5.0+cu100 -f https://download.pytorch.org/whl/cu100/torch_stable.html
- CUDA 10.1:
pip install torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html
- CUDA 10.2 or 11.0:
pip install torch torchvision
- CUDA 10.0:
- Install Apex:
pushd .. git clone https://github.com/NVIDIA/apex cd apex pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ popd
- Install Python requirements:
pip install -r requirements.txt
- EmoV-DB:
- Download the EmoV-DB dataset
- Normalize it:
ls */*/*.wav | xargs -I % sh -c 'mkdir -p ../out/$(dirname %) && sox % --rate 16000 -c 1 -b 16 ../out/%'
- Trim leading and trailing silences:
ls */*/*.wav | xargs -I @ sh -c 'mkdir -p ../out-no-silence/$(dirname @) && sox @ --rate 16000 -c 1 -b 16 ../out-no-silence/@ silence 1 0.1 1% reverse silence 1 0.1 1% reverse'
- (Optional) Manually trim non-verbal expressions:
- Generate a CSV file to be manually filled in with trim timestamps:
./genmanualtrimlist.py
- Use the CSV file to trim files:
./createcleanemovdb.py
- Generate a CSV file to be manually filled in with trim timestamps:
- LJSpeech:
- Download the LJSpeech dataset.
- Normalize it:
mkdir ../../LJSpeech-1.1/wavs && ls *.wav | xargs -I % sh -c 'sox % --rate 16000 -c 1 -b 16 ../../LJSpeech-1.1/wavs/%'
- Generate filelist files:
cd scripts vim ./genfilelist.py # Configure the script before running ./genfilelist.py cd ..
- Update the filelists inside the filelists folder to point to your data
python train.py --output_directory=outdir --log_directory=logdir
- (OPTIONAL)
tensorboard --logdir=outdir/logdir
Training using a pre-trained model can lead to faster convergence
By default, the emotion embedding layer is ignored
python train.py --output_directory=outdir --log_directory=logdir -c models/mellotron_libritts.pt --warm_start
python -m multiproc train.py --output_directory=outdir --log_directory=logdir --hparams=distributed_run=True,fp16_run=True
jupyter notebook --ip=127.0.0.1 --port=31337
- Load inference.ipynb
- (optional) Download our published WaveGlow model
WaveGlow Faster than real time Flow-based Generative Network for Speech Synthesis.
This project is a slight modification of Mellotron, developed by Rafael Valle, Jason Li, Ryan Prenger and Bryan Catanzaro.
In turn, Mellotron uses code from the following repos: Keith Ito, Prem Seetharaman, Chengqi Deng, Patrice Guyot, as described in our code.