seastar105 / so-vits-svc-5.0

Core Engine of Singing Voice Conversion & Singing Voice Clone

Home Page:https://huggingface.co/spaces/maxmax20160403/sovits5.0

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Variational Inference with adversarial learning for end-to-end Singing Voice Conversion based on VITS

Hugging Face Spaces Open in Colab GitHub Repo stars GitHub forks GitHub issues GitHub

  • 💗本项目的目标群体是:深度学习初学者,具备Pyhon和PyTorch的基本操作是使用本项目的前置条件;
  • 💗本项目旨在帮助深度学习初学者,摆脱枯燥的纯理论学习,通过与实践结合,熟练掌握深度学习基本知识;
  • 💗本项目不会开发用于其他用途的一键包。

sovits_framework

【无 泄漏】支持多发音人的SVC库

【带 伴奏】也能进行歌声转换的SVC库(轻度伴奏)

【用 Excel】进行原始的SVC手工调教

sonic visualiser

本项目并不基于svc-develop-team/so-vits-svc,恰恰相反,见https://github.com/svc-develop-team/so-vits-svc/tree/2.0

本项目还在调试开发中~~~,关闭issue封闭开发,完成后开放;7月1日前,无法完成则删除项目。

  • 预览模型包括:生成器+判别器=194M,设置batch_size为8时,训练占用7.5G显存,学习门槛大大降低
  • 预览模型包含56个发音人,发音人文件在configs/singers目录中,可进行推理测试,尤其测试音色泄露
  • 发音人22,30,47,51辨识度较高,发音人样本在configs/singers_sample目录中
Feature From Status Function Remarks
whisper OpenAI 强大的抗噪能力 必须
bigvgan NVIDA 抗锯齿与蛇形激活 删除,GPU占用过多
natural speech Microsoft 减少发音错误 二阶段训练
neural source-filter NII 解决断音问题 必须
speaker encoder Google 音色编码与聚类 必须
GRL for speaker Ubisoft 防止编码器泄露音色 二阶段训练
one shot vits Samsung VITS 一句话克隆 必须
SCLN Microsoft 改善克隆 必须
band extention Adobe 16K升48K采样 数据处理

数据集准备

💗必要的前处理:

然后以下面文件结构将数据集放入dataset_raw目录

dataset_raw
├───speaker0
│   ├───xxx1-xxx1.wav
│   ├───...
│   └───Lxx-0xx8.wav
└───speaker1
    ├───xx2-0xxx2.wav
    ├───...
    └───xxx7-xxx007.wav

安装依赖

数据预处理

  • 1, 设置工作目录:heartpulse::heartpulse::heartpulse:不设置后面会报错

    export PYTHONPATH=$PWD

  • 2, 重采样

    将音频剪裁为小于30秒的音频段,whisper的要求

    生成采样率16000Hz音频, 存储路径为:./data_svc/waves-16k

    python prepare/preprocess_a.py -w ./data_raw -o ./data_svc/waves-16k -s 16000

    生成采样率32000Hz音频, 存储路径为:./data_svc/waves-32k

    python prepare/preprocess_a.py -w ./data_raw -o ./data_svc/waves-32k -s 32000

    可选的16000Hz提升到32000Hz,待完善~批处理

    python bandex/inference.py -w svc_out.wav

  • 3, 使用16K音频,提取音高

    python prepare/preprocess_f0.py -w data_svc/waves-16k/ -p data_svc/pitch

  • 4, 使用16k音频,提取内容编码

    python prepare/preprocess_ppg.py -w data_svc/waves-16k/ -p data_svc/whisper

  • 5, 使用16k音频,提取音色编码

    python prepare/preprocess_speaker.py data_svc/waves-16k/ data_svc/speaker

  • 6, 使用32k音频,提取线性谱

    python prepare/preprocess_spec.py -w data_svc/waves-32k/ -s data_svc/specs

  • 7, 使用32k音频,生成训练索引

    python prepare/preprocess_train.py

  • 8, 训练文件调试

    python prepare/preprocess_zzz.py

训练

  • 1, 设置工作目录:heartpulse::heartpulse::heartpulse:不设置后面会报错

    export PYTHONPATH=$PWD

  • 2, 启动训练,一阶段训练

    python svc_trainer.py -c configs/base.yaml -n sovits5.0

  • 3, 恢复训练

    python svc_trainer.py -c configs/base.yaml -n sovits5.0 -p chkpt/sovits5.0/***.pth

  • 4, 查看日志,release页面有完整的训练日志

    tensorboard --logdir logs/

  • 5, 启动训练,二阶段训练:heartpulse:

    二阶段训练内容:PPG扰动,GRL去音色,natural speech推理loss;验证中~~~

    python svc_trainer.py -c configs/more.yaml -n more -e 1

20K一阶段训练日志如下,可以看到还未收敛完成

sovits5 0 preview

sovits_spec

推理

  • 1, 设置工作目录:heartpulse::heartpulse::heartpulse:不设置后面会报错

    export PYTHONPATH=$PWD

  • 2, 导出推理模型:文本编码器,Flow网络,Decoder网络;判别器和后验编码器只在训练中使用

    python svc_export.py --config configs/base.yaml --checkpoint_path chkpt/sovits5.0/***.pt

  • 3, 使用whisper提取内容编码,没有采用一键推理,为了降低显存占用

    python whisper/inference.py -w test.wav -p test.ppg.npy

    生成test.ppg.npy;如果下一步没有指定ppg文件,则调用程序自动生成

  • 4, 提取csv文本格式F0参数,Excel打开csv文件,对照Audition或者SonicVisualiser手动修改错误的F0

    python pitch/inference.py -w test.wav -p test.csv

Audition

  • 5,指定参数,推理

    python svc_inference.py --config configs/base.yaml --model sovits5.0.pth --spk ./configs/singers/singer0001.npy --wave test.wav --ppg test.ppg.npy --pit test.csv

    当指定--ppg后,多次推理同一个音频时,可以避免重复提取音频内容编码;没有指定,也会自动提取;

    当指定--pit后,可以加载手工调教的F0参数;没有指定,也会自动提取;

    生成文件在当前目录svc_out.wav;

    args --config --model --spk --wave --ppg --pit
    name 配置文件 模型文件 音色文件 音频文件 音频内容 音高内容

数据集

Name URL
KiSing http://shijt.site/index.php/2021/05/16/kising-the-first-open-source-mandarin-singing-voice-synthesis-corpus/
PopCS https://github.com/MoonInTheRiver/DiffSinger/blob/master/resources/apply_form.md
opencpop https://wenet.org.cn/opencpop/download/
Multi-Singer https://github.com/Multi-Singer/Multi-Singer.github.io
M4Singer https://github.com/M4Singer/M4Singer/blob/master/apply_form.md
CSD https://zenodo.org/record/4785016#.YxqrTbaOMU4
KSS https://www.kaggle.com/datasets/bryanpark/korean-single-speaker-speech-dataset
JVS MuSic https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_music
PJS https://sites.google.com/site/shinnosuketakamichi/research-topics/pjs_corpus
JUST Song https://sites.google.com/site/shinnosuketakamichi/publication/jsut-song
MUSDB18 https://sigsep.github.io/datasets/musdb.html#musdb18-compressed-stems
DSD100 https://sigsep.github.io/datasets/dsd100.html
Aishell-3 http://www.aishelltech.com/aishell_3
VCTK https://datashare.ed.ac.uk/handle/10283/2651

代码来源和参考文献

https://github.com/facebookresearch/speech-resynthesis paper

https://github.com/jaywalnut310/vits paper

https://github.com/openai/whisper/ paper

https://github.com/NVIDIA/BigVGAN paper

https://github.com/mindslab-ai/univnet [paper]

https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/master/project/01-nsf

https://github.com/brentspell/hifi-gan-bwe

https://github.com/mozilla/TTS

SNAC : Speaker-normalized Affine Coupling Layer in Flow-based Architecture for Zero-Shot Multi-Speaker Text-to-Speech

Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers

AdaSpeech: Adaptive Text to Speech for Custom Voice

Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis

Learn to Sing by Listening: Building Controllable Virtual Singer by Unsupervised Learning from Voice Recordings

贡献者

About

Core Engine of Singing Voice Conversion & Singing Voice Clone

https://huggingface.co/spaces/maxmax20160403/sovits5.0

License:MIT License


Languages

Language:Python 100.0%