dpss-exp3-VC-BNF

Voice Conversion Experiments for THUHCSI Course : <Digital Processing of Speech Signals>

Set up environment

Install sox from http://sox.sourceforge.net/
Install ffmpeg from https://www.ffmpeg.org/download.html#build-linux or apt-get install ffmpeg
Set up python environment through:

python3 -m venv /path/to/new/virtual/environment
source /path/to/new/virtual/environment/bin/activate
pip3 install -r dpss-exp3-VC-BNF/requirement_torch18.txt or dpss-exp3-VC-BNF/requirement_torch19.txt

requirement_torch18.txt for V100 and cuda11.2, requirement_torch19.txt for A100 and cuda11.2

or you may need to setup your own environment depends on GPU and cuda you have.

Data Preparation

Download bzn/mst-male/mst-female corpus from here http://10.103.10.112:8080/sub_dataset.tar
Download pretrained ASR model from here http://10.103.10.112:8080/pretrained_model/final.pt
move final.pt to ./pretrained_model/asr_model
you can find all the file mentioned above from https://cloud.tsinghua.edu.cn/d/0edf01d65a194ec9aceb/

If get 'Could not find a version for torch==1.9.0+cu111' , see https://jishuin.proginn.com/p/763bfbd5e54b.

'pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html' will solve this problem.

Extract the dataset, and organize your data directories as follows:

dataset/
├── mst-female
├── mst-male
├── bzn

Any-to-One Voice Conversion Model

Feature Extraction

CUDA_VISIBLE_DEVICES=0 python preprocess.py --data_dir /path/to/dataset/bzn --save_dir /path/to/save_data/bzn/

Your extracted features will be organized as follows:

bzn/
├── dev_meta.csv
├── f0s
│   ├── bzn_000001.npy
│   ├── ...
├── linears
│   ├── bzn_000001.npy
│   ├── ...
├── mels
│   ├── bzn_000001.npy
│   ├── ...
├── ppgs
│   ├── bzn_000001.npy
│   ├── ...
├── test_meta.csv
└── train_meta.csv

Train

if you have GPU (one typical GPU is enough, nearly 1s/batch):

CUDA_VISIBLE_DEVICES=0 python train_to_one.py --model_dir ./exps/model_dir_to_bzn --test_dir ./exps/test_dir_to_bzn --data_dir /path/to/save_data/bzn/

if you have no GPU (nearly 5s/batch):

python train_to_one.py --model_dir ./exps/model_dir_to_bzn --test_dir ./exps/test_dir_to_bzn --data_dir /path/to/save_data/bzn/

Inference

CUDA_VISIBLE_DEVICES=0 python inference_to_one.py --src_wav /path/to/source/xx.wav --ckpt ../exps/model_dir_to_bzn/bnf-vc-to-one-49.pt --save_dir ./test_dir/

Any-to-Many Voice Conversion Model

Feature Extraction

# in any-to-many VC task, we use all the above 3 speakers as the target speaker set.
CUDA_VISIBLE_DEVICES=0 python preprocess.py --data_dir /path/to/dataset/ --save_dir /path/to/save_data/exp3-data-all

Your extracted features will be organized as follows:

exp3-data-all/
├── dev_meta.csv
├── f0s
│   ├── bzn_000001.npy
│   ├── ...
├── linears
│   ├── bzn_000001.npy
│   ├── ...
├── mels
│   ├── bzn_000001.npy
│   ├── ...
├── ppgs
│   ├── bzn_000001.npy
│   ├── ...
├── test_meta.csv
└── train_meta.csv

Train

if you have GPU (one typical GPU is enough, nearly 1s/batch):

CUDA_VISIBLE_DEVICES=0 python train_to_many.py --model_dir ./exps/model_dir_to_many --test_dir ./exps/test_dir_to_many --data_dir /path/to/save_data/exp3-data-all

if you have no GPU (nearly 5s/batch):

python train_to_many.py --model_dir ./exps/model_dir_to_many --test_dir ./exps/test_dir_to_many --data_dir /path/to/save_data/exp3-data-all

Inference

# here for inference, we use 'mst-male' as the target speaker. you can change the tgt_spk argument to any of the above 3 speakers. 
CUDA_VISIBLE_DEVICES=0 python inference_to_many.py --src_wav /path/to/source/*.wav --tgt_spk bzn/mst-female/mst-male --ckpt ./model_dir/bnf-vc-to-many-49.pt --save_dir ./test_dir/

Assignment requirements

This project is a vanilla voice conversion system based on BNFs.

When you encounter problems while finishing your project, search the issues first to see if there are similar problems. If there are no similar problems, you can create new issues and state you problems clearly.

liphao99 / dpss-exp3-VC-BNF

dpss-exp3-VC-BNF

Set up environment

Data Preparation

Any-to-One Voice Conversion Model

Feature Extraction

Train

Inference

Any-to-Many Voice Conversion Model

Feature Extraction

Train

Inference

Assignment requirements

About

Languages