deep-learning speech-synthesis voice-cloning voice-conversion

Voice Conversion Using Zero-Shot Learning

This is a TensorFlow + Pytorch implementation. This implementation is adapted from the Real Time Voice Clone implementation at https://github.com/CorentinJ/Real-Time-Voice-Cloning.

Installation

Python 3.8

Install PyTorch (>=1.0.1).
Install Nvidia version of TensorFlow 1.15
Install ffmpeg.
Install Kaldi
Install PyKaldi
Run pip install -r requirements.txt to install the remaining necessary packages.
Download pretrained TDNN-F model, extract it, and set PRETRAIN_ROOT in kaldi_scripts/extract_features_kaldi.sh to the pretrained model directory.

Dataset

Acoustic Model: LibriSpeech. Download pretrained TDNN-F acoustic model here.
- You also need to set KALDI_ROOT and PRETRAIN_ROOT in kaldi_scripts/extract_features_kaldi.sh accordingly.
Speaker Encoder: LibriSpeech, see here for detailed training process.
Synthesizer (i.e., Seq2seq model): ARCTIC and L2-ARCTIC. Please see here for a merged version.
Vocoder: LibriSpeech, see here for detailed training process.

All the pretrained the models are available here

Quick Start

See the inference script

Training

Use Kaldi to extract BNF for the reference L1 speaker

./kaldi_scripts/extract_features_kaldi.sh /path/to/L2-ARCTIC/BDL

Preprocessing

python synthesizer_preprocess_audio.py /path/to/L2-ARCTIC BDL /path/to/L2-ARCTIC/BDL/kaldi --out_dir=your_preprocess_output_dir
python synthesizer_preprocess_embeds.py your_preprocess_output_dir

Training

python synthesizer_train.py Accetron_train your_preprocess_output_dir

About

PPG based Voice Conversion Using Zero-Shot Learning

deep-learning speech-synthesis voice-cloning voice-conversion

Apache License 2.0

Languages

Language:Python 97.9%Language:Jupyter Notebook 1.2%Language:Shell 0.9%