DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

| | 中文文档

This repository is the official PyTorch implementation of our AAAI-2022 paper, in which we propose DiffSinger (for Singing-Voice-Synthesis) and DiffSpeech (for Text-to-Speech).

DiffSinger/DiffSpeech at training	DiffSinger/DiffSpeech at inference

🎉 🎉 🎉 Updates:

Mar.2, 2022: MIDI-new-version: A substantial improvement ✨
Mar.1, 2022: NeuralSVB, for singing voice beautifying, has been released ✨ ✨ ✨ .
Feb.13, 2022: NATSpeech, the improved code framework, which contains the implementations of DiffSpeech and our NeurIPS-2021 work PortaSpeech has been released ✨ ✨ ✨.
Jan.29, 2022: support MIDI-old-version SVS. Keep Updating. 🚧 ⛏️ 🛠️
Jan.13, 2022: support SVS, release PopCS dataset.
Dec.19, 2021: support TTS. HuggingFace🤗 Demo

🚀 News:

Feb.24, 2022: Our new work, NeuralSVB was accepted by ACL-2022 . Demo Page.
Dec.01, 2021: DiffSinger was accepted by AAAI-2022.
Sep.29, 2021: Our recent work PortaSpeech: Portable and High-Quality Generative Text-to-Speech was accepted by NeurIPS-2021 .
May.06, 2021: We submitted DiffSinger to Arxiv .

Environments

conda create -n your_env_name python=3.8
source activate your_env_name 
pip install -r requirements_2080.txt   (GPU 2080Ti, CUDA 10.2)
or pip install -r requirements_3090.txt   (GPU 3090, CUDA 11.4)

DiffSpeech (TTS version)

1. Preparation

Data Preparation

a) Download and extract the LJ Speech dataset, then create a link to the dataset folder: ln -s /xxx/LJSpeech-1.1/ data/raw/

b) Download and Unzip the ground-truth duration extracted by MFA: tar -xvf mfa_outputs.tar; mv mfa_outputs data/processed/ljspeech/

c) Run the following scripts to pack the dataset for training/inference.

export PYTHONPATH=.
CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config configs/tts/lj/fs2.yaml

# `data/binary/ljspeech` will be generated.

Vocoder Preparation

We provide the pre-trained model of HifiGAN vocoder. Please unzip this file into checkpoints before training your acoustic model.

2. Training Example

First, you need a pre-trained FastSpeech2 checkpoint. You can use the pre-trained model, or train FastSpeech2 from scratch, run:

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config configs/tts/lj/fs2.yaml --exp_name fs2_lj_1 --reset

Then, to train DiffSpeech, run:

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/lj_ds_beta6.yaml --exp_name lj_ds_beta6_1213 --reset

Remember to adjust the "fs2_ckpt" parameter in usr/configs/lj_ds_beta6.yaml to fit your path.

3. Inference Example

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/lj_ds_beta6.yaml --exp_name lj_ds_beta6_1213 --reset --infer

We also provide:

the pre-trained model of DiffSpeech;
the individual pre-trained model of FastSpeech 2 for the shallow diffusion mechanism in DiffSpeech;

Remember to put the pre-trained models in checkpoints directory.

DiffSinger (SVS version)

0. Data Acquirement

See in apply_form.
Dataset preview.

1. Preparation

Data Preparation

a) Download and extract PopCS, then create a link to the dataset folder: ln -s /xxx/popcs/ data/processed/popcs

b) Run the following scripts to pack the dataset for training/inference.

export PYTHONPATH=.
CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config usr/configs/popcs_ds_beta6.yaml
# `data/binary/popcs-pmf0` will be generated.

Vocoder Preparation

We provide the pre-trained model of HifiGAN-Singing which is specially designed for SVS with NSF mechanism. Please unzip this file into checkpoints before training your acoustic model.

(Update: You can also move a ckpt with more training steps into this vocoder directory)

This singing vocoder is trained on ~70 hours singing data, which can be viewed as a universal vocoder.

2. Training Example

First, you need a pre-trained FFT-Singer checkpoint. You can use the pre-trained model, or train FFT-Singer from scratch, run:

# First, train fft-singer;
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_fs2.yaml --exp_name popcs_fs2_pmf0_1230 --reset
# Then, infer fft-singer;
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_fs2.yaml --exp_name popcs_fs2_pmf0_1230 --reset --infer

Then, to train DiffSinger, run:

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_ds_beta6_offline.yaml --exp_name popcs_ds_beta6_offline_pmf0_1230 --reset

Remember to adjust the "fs2_ckpt" parameter in usr/configs/popcs_ds_beta6_offline.yaml to fit your path.

3. Inference Example

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_ds_beta6_offline.yaml --exp_name popcs_ds_beta6_offline_pmf0_1230 --reset --infer

We also provide:

the pre-trained model of DiffSinger;
the pre-trained model of FFT-Singer for the shallow diffusion mechanism in DiffSinger;

Remember to put the pre-trained models in checkpoints directory.

Note that:

the original PWG version vocoder in the paper we used has been put into commercial use, so we provide this HifiGAN version vocoder as a substitute.
we assume the ground-truth F0 to be given as the pitch information following [1][2][3]. If you want to conduct experiments on MIDI data, you need an external F0 predictor (like MIDI-old-version) or a joint prediction with spectrograms(like MIDI-new-version).

[1] Adversarially trained multi-singer sequence-to-sequence singing synthesizer. Interspeech 2020.

[2] SEQUENCE-TO-SEQUENCE SINGING SYNTHESIS USING THE FEED-FORWARD TRANSFORMER. ICASSP 2020.

[3] DeepSinger : Singing Voice Synthesis with Data Mined From the Web. KDD 2020.

Tensorboard

tensorboard --logdir_spec exp_name

Mel Visualization

Along vertical axis, DiffSpeech: [0-80]; FastSpeech2: [80-160].

DiffSpeech vs. FastSpeech 2

Audio Demos

Audio samples can be found in our demo page.

We also put part of the audio samples generated by DiffSpeech+HifiGAN (marked as [P]) and GTmel+HifiGAN (marked as [G]) of test set in resources/demos_1213.

(corresponding to the pre-trained model DiffSpeech)

🚀 🚀 🚀 Update:

New singing samples can be found in resources/demos_0112.

Citation

@article{liu2021diffsinger,
  title={Diffsinger: Singing voice synthesis via shallow diffusion mechanism},
  author={Liu, Jinglin and Li, Chengxi and Ren, Yi and Chen, Feiyang and Liu, Peng and Zhao, Zhou},
  journal={arXiv preprint arXiv:2105.02446},
  volume={2},
  year={2021}}

Acknowledgements

Our codes are based on the following repos:

Also thanks Keon Lee for fast implementation of our work.

johndpope / DiffSinger

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

Environments

DiffSpeech (TTS version)

1. Preparation

Data Preparation

Vocoder Preparation

2. Training Example

3. Inference Example

DiffSinger (SVS version)

0. Data Acquirement

1. Preparation

Data Preparation

Vocoder Preparation

2. Training Example

3. Inference Example

Tensorboard

Mel Visualization

Audio Demos

Citation

Acknowledgements

About

Languages