XXxin1/VQMIVC

VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion (Interspeech 2021)

Paper | Pre-trained models | Demo

This paper proposes a speech representation disentanglement framework for one-shot/any-to-any voice conversion, which performs conversion across arbitrary speakers with only a single target-speaker utterance for reference. Vector quantization with contrastive predictive coding (VQCPC) is used for content encoding and mutual information (MI) is introduced as the correlation metric during training, to achieve proper disentanglement of content, speaker and pitch representations, by reducing their inter-dependencies in an unsupervised manner.

Requirements

Python 3.6 is used, install apex for speeding up training (optional), other requirements are listed in 'requirements.txt':

pip install -r requirements.txt

Quick start with pre-trained models

ParallelWaveGAN is used as the vocoder, so firstly please install ParallelWaveGAN to try the pre-trained models:

python convert_example.py -s {source-wav} -r {reference-wav} -c {converted-wavs-save-path} -m {model-path}

For example:

python convert_example.py -s test_wavs/p225_038.wav -r test_wavs/p334_047.wav -c converted -m checkpoints/useCSMITrue_useCPMITrue_usePSMITrue_useAmpTrue/VQMIVC-model.ckpt-500.pt

The converted wav is put in 'converted' directory.

Training and inference:

Step1. Data preparation & preprocessing

Put VCTK corpus under directory: 'Dataset/'
Training/testing speakers split & feature (mel+lf0) extraction:
```
 python preprocess.py
```

Step2. model training:

Training with mutual information minimization (MIM):

 python train.py use_CSMI=True use_CPMI=True use_PSMI=True

Training without MIM:

 python train.py use_CSMI=False use_CPMI=False use_PSMI=False

Step3. model testing:

Put PWG vocoder under directory: 'vocoder/'

Inference with model trained with MIM:

 python convert.py checkpoint=checkpoints/useCSMITrue_useCPMITrue_usePSMITrue_useAmpTrue/model.ckpt-500.pt

Inference with model trained without MIM:

 python convert.py checkpoint=checkpoints/useCSMIFalse_useCPMIFalse_usePSMIFalse_useAmpTrue/model.ckpt-500.pt

Citation

If the code is used in your research, please Star our repo and cite our paper:

@article{wang2021vqmivc,
  title={VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion},
  author={Wang, Disong and Deng, Liqun and Yeung, Yu Ting and Chen, Xiao and Liu, Xunying and Meng, Helen},
  journal={arXiv preprint arXiv:2106.10132},
  year={2021}
}

Acknowledgements:

The content encoder is borrowed from VectorQuantizedCPC, which also inspires the negative sampling within-utterance for CPC;
The speaker encoder is borrowed from AdaIN-VC;
The decoder is modified from AutoVC;
Estimation of mutual information is modified from CLUB;
Speech features extraction is based on espnet and Pyworld.

XXxin1 / VQMIVC