FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion

In this paper, we adopt the end-to-end framework of VITS for high-quality waveform reconstruction, and propose strategies for clean content information extraction without text annotation. We disentangle content information by imposing an information bottleneck to WavLM features, and propose the spectrogram-resize based data augmentation to improve the purity of extracted content information.

Visit our demo page for audio samples.

We also provide the pretrained models.

(a) Training	(b) Inference

Pre-requisites

Clone this repo: git clone https://github.com/OlaWod/FreeVC.git
CD into this repo: cd FreeVC
Install python requirements: pip install -r requirements.txt
Download WavLM-Large and put it under directory 'wavlm/'
Download the VCTK dataset (for training only)
Download HiFi-GAN model and put it under directory 'hifigan/' (for training with SR only)

Inference Example

Download the pretrained checkpoints and run:

# inference with FreeVC
CUDA_VISIBLE_DEVICES=0 python convert.py --hpfile logs/freevc.json --ptfile checkpoints/freevc.pth --txtpath convert.txt --outdir outputs/freevc

# inference with FreeVC-s
CUDA_VISIBLE_DEVICES=0 python convert.py --hpfile logs/freevc-s.json --ptfile checkpoints/freevc-s.pth --txtpath convert.txt --outdir outputs/freevc-s

Training Example

Preprocess

python downsample.py --in_dir </path/to/VCTK/wavs>
ln -s dataset/vctk-16k DUMMY

# run this if you want a different train-val-test split
python preprocess_flist.py

# run this if you want to use pretrained speaker encoder
CUDA_VISIBLE_DEVICES=0 python preprocess_spk.py

# run this if you want to train without SR-based augmentation
CUDA_VISIBLE_DEVICES=0 python preprocess_ssl.py

# run these if you want to train with SR-based augmentation
CUDA_VISIBLE_DEVICES=1 python preprocess_sr.py --min 68 --max 72
CUDA_VISIBLE_DEVICES=1 python preprocess_sr.py --min 73 --max 76
CUDA_VISIBLE_DEVICES=2 python preprocess_sr.py --min 77 --max 80
CUDA_VISIBLE_DEVICES=2 python preprocess_sr.py --min 81 --max 84
CUDA_VISIBLE_DEVICES=3 python preprocess_sr.py --min 85 --max 88
CUDA_VISIBLE_DEVICES=3 python preprocess_sr.py --min 89 --max 92

Train

# train freevc
CUDA_VISIBLE_DEVICES=0 python train.py -c configs/freevc.json -m freevc

# train freevc-s
CUDA_VISIBLE_DEVICES=2 python train.py -c configs/freevc-s.json -m freevc-s

qiaolinwang / FreeVC

FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion

Pre-requisites

Inference Example

Training Example

About

Languages