johndpope / DiffSinger

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (SVS & TTS); AAAI 2022; Official code

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

arXiv GitHub Stars downloads | Hugging Face | 中文文档

This repository is the official PyTorch implementation of our AAAI-2022 paper, in which we propose DiffSinger (for Singing-Voice-Synthesis) and DiffSpeech (for Text-to-Speech).

DiffSinger/DiffSpeech at training DiffSinger/DiffSpeech at inference
Training Inference

🎉 🎉 🎉 Updates:

  • Mar.2, 2022: MIDI-new-version: A substantial improvement ✨
  • Mar.1, 2022: NeuralSVB, for singing voice beautifying, has been released ✨ ✨ ✨ .
  • Feb.13, 2022: NATSpeech, the improved code framework, which contains the implementations of DiffSpeech and our NeurIPS-2021 work PortaSpeech has been released ✨ ✨ ✨.
  • Jan.29, 2022: support MIDI-old-version SVS. Keep Updating. 🚧 ⛏️ 🛠️
  • Jan.13, 2022: support SVS, release PopCS dataset.
  • Dec.19, 2021: support TTS. HuggingFace🤗 Demo

🚀 News:

  • Feb.24, 2022: Our new work, NeuralSVB was accepted by ACL-2022 arXiv. Demo Page.
  • Dec.01, 2021: DiffSinger was accepted by AAAI-2022.
  • Sep.29, 2021: Our recent work PortaSpeech: Portable and High-Quality Generative Text-to-Speech was accepted by NeurIPS-2021 arXiv .
  • May.06, 2021: We submitted DiffSinger to Arxiv arXiv.

Environments

conda create -n your_env_name python=3.8
source activate your_env_name 
pip install -r requirements_2080.txt   (GPU 2080Ti, CUDA 10.2)
or pip install -r requirements_3090.txt   (GPU 3090, CUDA 11.4)

DiffSpeech (TTS version)

1. Preparation

Data Preparation

a) Download and extract the LJ Speech dataset, then create a link to the dataset folder: ln -s /xxx/LJSpeech-1.1/ data/raw/

b) Download and Unzip the ground-truth duration extracted by MFA: tar -xvf mfa_outputs.tar; mv mfa_outputs data/processed/ljspeech/

c) Run the following scripts to pack the dataset for training/inference.

export PYTHONPATH=.
CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config configs/tts/lj/fs2.yaml

# `data/binary/ljspeech` will be generated.

Vocoder Preparation

We provide the pre-trained model of HifiGAN vocoder. Please unzip this file into checkpoints before training your acoustic model.

2. Training Example

First, you need a pre-trained FastSpeech2 checkpoint. You can use the pre-trained model, or train FastSpeech2 from scratch, run:

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config configs/tts/lj/fs2.yaml --exp_name fs2_lj_1 --reset

Then, to train DiffSpeech, run:

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/lj_ds_beta6.yaml --exp_name lj_ds_beta6_1213 --reset

Remember to adjust the "fs2_ckpt" parameter in usr/configs/lj_ds_beta6.yaml to fit your path.

3. Inference Example

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/lj_ds_beta6.yaml --exp_name lj_ds_beta6_1213 --reset --infer

We also provide:

  • the pre-trained model of DiffSpeech;
  • the individual pre-trained model of FastSpeech 2 for the shallow diffusion mechanism in DiffSpeech;

Remember to put the pre-trained models in checkpoints directory.

DiffSinger (SVS version)

0. Data Acquirement

1. Preparation

Data Preparation

a) Download and extract PopCS, then create a link to the dataset folder: ln -s /xxx/popcs/ data/processed/popcs

b) Run the following scripts to pack the dataset for training/inference.

export PYTHONPATH=.
CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config usr/configs/popcs_ds_beta6.yaml
# `data/binary/popcs-pmf0` will be generated.

Vocoder Preparation

We provide the pre-trained model of HifiGAN-Singing which is specially designed for SVS with NSF mechanism. Please unzip this file into checkpoints before training your acoustic model.

(Update: You can also move a ckpt with more training steps into this vocoder directory)

This singing vocoder is trained on ~70 hours singing data, which can be viewed as a universal vocoder.

2. Training Example

First, you need a pre-trained FFT-Singer checkpoint. You can use the pre-trained model, or train FFT-Singer from scratch, run:

# First, train fft-singer;
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_fs2.yaml --exp_name popcs_fs2_pmf0_1230 --reset
# Then, infer fft-singer;
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_fs2.yaml --exp_name popcs_fs2_pmf0_1230 --reset --infer 

Then, to train DiffSinger, run:

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_ds_beta6_offline.yaml --exp_name popcs_ds_beta6_offline_pmf0_1230 --reset

Remember to adjust the "fs2_ckpt" parameter in usr/configs/popcs_ds_beta6_offline.yaml to fit your path.

3. Inference Example

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/popcs_ds_beta6_offline.yaml --exp_name popcs_ds_beta6_offline_pmf0_1230 --reset --infer

We also provide:

  • the pre-trained model of DiffSinger;
  • the pre-trained model of FFT-Singer for the shallow diffusion mechanism in DiffSinger;

Remember to put the pre-trained models in checkpoints directory.

Note that:

  • the original PWG version vocoder in the paper we used has been put into commercial use, so we provide this HifiGAN version vocoder as a substitute.
  • we assume the ground-truth F0 to be given as the pitch information following [1][2][3]. If you want to conduct experiments on MIDI data, you need an external F0 predictor (like MIDI-old-version) or a joint prediction with spectrograms(like MIDI-new-version).

[1] Adversarially trained multi-singer sequence-to-sequence singing synthesizer. Interspeech 2020.

[2] SEQUENCE-TO-SEQUENCE SINGING SYNTHESIS USING THE FEED-FORWARD TRANSFORMER. ICASSP 2020.

[3] DeepSinger : Singing Voice Synthesis with Data Mined From the Web. KDD 2020.

Tensorboard

tensorboard --logdir_spec exp_name
Tensorboard

Mel Visualization

Along vertical axis, DiffSpeech: [0-80]; FastSpeech2: [80-160].

DiffSpeech vs. FastSpeech 2
DiffSpeech-vs-FastSpeech2
DiffSpeech-vs-FastSpeech2
DiffSpeech-vs-FastSpeech2

Audio Demos

Audio samples can be found in our demo page.

We also put part of the audio samples generated by DiffSpeech+HifiGAN (marked as [P]) and GTmel+HifiGAN (marked as [G]) of test set in resources/demos_1213.

(corresponding to the pre-trained model DiffSpeech)


🚀 🚀 🚀 Update:

New singing samples can be found in resources/demos_0112.

Citation

@article{liu2021diffsinger,
  title={Diffsinger: Singing voice synthesis via shallow diffusion mechanism},
  author={Liu, Jinglin and Li, Chengxi and Ren, Yi and Chen, Feiyang and Liu, Peng and Zhao, Zhou},
  journal={arXiv preprint arXiv:2105.02446},
  volume={2},
  year={2021}}

Acknowledgements

Our codes are based on the following repos:

Also thanks Keon Lee for fast implementation of our work.

About

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (SVS & TTS); AAAI 2022; Official code

License:MIT License


Languages

Language:Python 100.0%