vamoko / StyleSinger

PyTorch Implementation of StyleSinger(AAAI 2024): Style Transfer for Out-of-Domain Singing Voice Synthesis

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis

Yu Zhang, Rongjie Huang, Ruiqi Li, JinZheng He, Yan Xia, Feiyang Chen, Xinyu Duan, Baoxing Huai, Zhou Zhao | Zhejiang University, Huawei Cloud

PyTorch Implementation of StyleSinger (AAAI 2024): Style Transfer for Out-of-Domain Singing Voice Synthesis.

arXiv

We provide our implementation and pre-trained models in this repository.

Visit our demo page for audio samples.

Pre-trained Models

You can use the pre-trained models we provide here. Details of each folder are as follows:

Model Description
StyleSinger Acousitic model (config)
HIFI-GAN Neural Vocoder
Encoder Emotion Encoder

Dependencies

A suitable conda environment named stylesinger can be created and activated with:

conda create -n stylesinger python=3.8
conda install --yes --file requirements.txt
conda activate stylesinger

Multi-GPU

By default, this implementation uses as many GPUs in parallel as returned by torch.cuda.device_count(). You can specify which GPUs to use by setting the CUDA_DEVICES_AVAILABLE environment variable before running the training module.

Inference towards style transfer of custom timbre and style

Here we provide a speech synthesis pipeline using StyleSinger.

  1. Prepare StyleSinger (acoustic model): Download and put checkpoint at checkpoints/StyleSinger
  2. Prepare HIFI-GAN (neural vocoder): Download and put checkpoint at checkpoints/hifigan
  3. Prepare Emotion Encoder: Download and put checkpoint at checkpoints/global.pt
  4. Prepare dataset: Download and put statistical files at data/binary/test_set
  5. Prepare reference information: Provide a reference_audio (48k) and input target ph, target note for each ph, target note_dur for each ph, target note_type for each ph (rest: 1, lyric: 2, slur: 3), and reference audio path. Input these information in Inference/StyleSinger.py.
CUDA_VISIBLE_DEVICES=$GPU python inference/StyleSinger.py --config egs/stylesinger.yaml  --exp_name checkpoints/StyleSinger

Generated wav files are saved in infer_out by default.

Train your own model

Data Preparation

  1. Prepare your own singing dataset or downlowad M4Singer (Note: you have to segment M4Singer and align note pitch for each ph, note duration for each ph, and note types (rest: 1, lyric: 2, slur: 3) for each ph as ep_pitches, ep_notedurs, ep_types)
  2. Put metadata.json (including ph, word, item_name, ph_durs, wav_fn, singer, ep_pitches, ep_notedurs, ep_types for each singing voice), spker_set.json (including all singers and their id), and phone_set.json (all phonemes of your dictionary) in data/processed/style
  3. Set processed_data_dir, binary_data_dir,valid_prefixes, test_prefixes in the config.
  4. Download global emotion encoder to emotion_encoder_path.
  5. Preprocess Dataset
export PYTHONPATH=.
CUDA_VISIBLE_DEVICES=$GPU python data_gen/tts/bin/binarize.py --config egs/stylesinger.yaml

Training StyleSinger

CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config egs/stylesinger.yaml  --exp_name StyleSinger --reset

Inference using StyleSinger

CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config egs/stylesinger.yaml  --exp_name StyleSinger --infer

Quick Inference

We provide a mini-set of test samples to demonstrate StyleSinger in here. Specifically, we provide samples of statistical files which is for faster IO. Please download the statistical files at data/binary/style/, while the WAV files are for listening.

Run

CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config egs/stylesinger.yaml  --exp_name StyleSinger --infer

You will find outputs in checkpoints/StyleSinger/generated_320000_/wavs, where [Ref] indicates ground truth mel results and [SVS] indicates predicted results.

Acknowledgements

This implementation uses parts of the code from the following Github repos: GenerSpeech, NATSpeech, ProDiff, DiffSinger as described in our code.

Citations

If you find this code useful in your research, please cite our work:

@inproceedings{zhang2024stylesinger,
  title={StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis},
  author={Zhang, Yu and Huang, Rongjie and Li, Ruiqi and He, JinZheng and Xia, Yan and Chen, Feiyang and Duan, Xinyu and Huai, Baoxing and Zhao, Zhou},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={38},
  number={17},
  pages={19597--19605},
  year={2024}
}

Disclaimer

Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's singing without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

About

PyTorch Implementation of StyleSinger(AAAI 2024): Style Transfer for Out-of-Domain Singing Voice Synthesis

License:MIT License


Languages

Language:Python 100.0%