SoftVC VITS Singing Voice Conversion

Terms of Use

This project is established for academic exchange purposes only and is intended for communication and learning purposes. It is not intended for production environments. Please solve the authorization problem of the dataset on your own. You shall be solely responsible for any problems caused by the use of non-authorized datasets for training and all consequences thereof.
Any videos based on sovits that are published on video platforms must clearly indicate in the description that they are used for voice changing and specify the input source of the voice or audio, for example, using videos or audios published by others and separating the vocals as input source for conversion, which must provide clear original video or music links. If your own voice or other synthesized voices from other commercial vocal synthesis software are used as the input source for conversion, you must also explain it in the description.
You shall be solely responsible for any infringement problems caused by the input source. When using other commercial vocal synthesis software as input source, please ensure that you comply with the terms of use of the software. Note that many vocal synthesis engines clearly state in their terms of use that they cannot be used for input source conversion.
Continuing to use this project is deemed as agreeing to the relevant provisions stated in this repository README. This repository README has the obligation to persuade, and is not responsible for any subsequent problems that may arise.
If you distribute this repository's code or publish any results produced by this project publicly (including but not limited to video sharing platforms), please indicate the original author and code source (this repository).
If you use this project for any other plan, please contact and inform the author of this repository in advance. Thank you very much.

Model Introduction

The singing voice conversion model uses SoftVC content encoder to extract source audio speech features, and inputs them together with F0 to replace the original text input to achieve the effect of song conversion. At the same time, the vocoder is changed to NSF HiFiGAN to solve the problem of sound interruption.

4.0 v2 update content

The model architecture is completely change to visinger2
Others are exactly the same as 4.0.

4.0 v2 features

It is better than 4.0 in some scenes.（For example, the current sound in the breath sound）
But there is also a certain retrogression in some scene. For example, training with data from streaming of vtubers is not as good as 4.0. Also in some cases it will turn out a terrible sound.
4.0-v2 is the last version of sovits, there is no more update in the future.

Note

4.0-v2 and 4.0 are almost identical in process, which include preprocessing and requirements.
The difference from 4.0 is:
- The models are completely different. Check the version of the pretrained models if you are using them.
- The structure of config file changed a lot. You can only run python preprocess_flist_config.py to generate new config.json if you are using preprocessed dataset from 4.0.

Pre-trained Model Files

Required

ContentVec: checkpoint_best_legacy_500.pt
- Place it under the hubert directory

# contentvec
wget -P hubert/ http://obs.cstcloud.cn/share/obs/sankagenkeshi/checkpoint_best_legacy_500.pt
# Alternatively, you can manually download and place it in the hubert directory

Optional(Strongly recommend)

# G and D pre-training model:
wget -P logs/44k/ https://huggingface.co/justinjohn-03/so-vits-svc-4.0-v2-pretrained/resolve/main/G_0.pth
wget -P logs/44k/ https://huggingface.co/justinjohn-03/so-vits-svc-4.0-v2-pretrained/resolve/main/D_0.pth

Dataset Preparation

Simply place the dataset in the dataset_raw directory with the following file structure.

dataset_raw
├───speaker0
│   ├───xxx1-xxx1.wav
│   ├───...
│   └───Lxx-0xx8.wav
└───speaker1
    ├───xx2-0xxx2.wav
    ├───...
    └───xxx7-xxx007.wav

Preprocessing

Resample to 44100hz

python resample.py

Automatically split the dataset into training, validation, and test sets, and generate configuration files

python preprocess_flist_config.py

Generate hubert and f0

python preprocess_hubert_f0.py

After completing the above steps, the dataset directory will contain the preprocessed data, and the dataset_raw folder can be deleted.

Training

python train.py -c configs/config.json -m 44k

Note: During training, the old models will be automatically cleared and only the latest three models will be kept. If you want to prevent overfitting, you need to manually backup the model checkpoints, or modify the configuration file keep_ckpts to 0 to never clear them.

Inference

Use inference_main.py

Up to this point, the usage of version 4.0 (training and inference) is exactly the same as version 3.0, with no changes (inference now has command line support).

# Example
python inference_main.py -m "logs/44k/G_30400.pth" -c "configs/config.json" -n "君の知らない物語-src.wav" -t 0 -s "nen"

Required parameters:

-m, --model_path: path to the model.
-c, --config_path: path to the configuration file.
-n, --clean_names: a list of wav file names located in the raw folder.
-t, --trans: pitch adjustment, supports positive and negative (semitone) values.
-s, --spk_list: target speaker name for synthesis.

Optional parameters: see the next section

-a, --auto_predict_f0: automatic pitch prediction for voice conversion, do not enable this when converting songs as it can cause serious pitch issues.
-cm, --cluster_model_path: path to the clustering model, fill in any value if clustering is not trained.
-cr, --cluster_infer_ratio: proportion of the clustering solution, range 0-1, fill in 0 if the clustering model is not trained.

Optional Settings

If the results from the previous section are satisfactory, or if you didn't understand what is being discussed in the following section, you can skip it, and it won't affect the model usage. (These optional settings have a relatively small impact, and they may have some effect on certain specific data, but in most cases, the difference may not be noticeable.)

Automatic f0 prediction

During the 4.0 model training, an f0 predictor is also trained, which can be used for automatic pitch prediction during voice conversion. However, if the effect is not good, manual pitch prediction can be used instead. But please do not enable this feature when converting singing voice as it may cause serious pitch shifting!

Set "auto_predict_f0" to true in inference_main.

Cluster-based timbre leakage control

Introduction: The clustering scheme can reduce timbre leakage and make the trained model sound more like the target's timbre (although this effect is not very obvious), but using clustering alone will lower the model's clarity (the model may sound unclear). Therefore, this model adopts a fusion method to linearly control the proportion of clustering and non-clustering schemes. In other words, you can manually adjust the ratio between "sounding like the target's timbre" and "being clear and articulate" to find a suitable trade-off point.

The existing steps before clustering do not need to be changed. All you need to do is to train an additional clustering model, which has a relatively low training cost.

Training process:
- Train on a machine with a good CPU performance. According to my experience, it takes about 4 minutes to train each speaker on a Tencent Cloud 6-core CPU.
- Execute "python cluster/train_cluster.py". The output of the model will be saved in "logs/44k/kmeans_10000.pt".
Inference process:
- Specify "cluster_model_path" in inference_main.
- Specify "cluster_infer_ratio" in inference_main, where 0 means not using clustering at all, 1 means only using clustering, and usually 0.5 is sufficient.

sovits4v2 for colab.ipynb

Exporting to Onnx

Use onnx_export.py

Create a folder named checkpoints and open it.
Create a folder in the checkpoints folder as your project folder, naming it after your project, for example aziplayer.
Rename your model as model.pth, the configuration file as config.json, and place them in the aziplayer folder you just created.
Modify "NyaruTaffy" in path = "NyaruTaffy" in onnx_export.py to your project name, path = "aziplayer".
Run onnx_export.py.
Wait for it to finish running. A model.onnx will be generated in your project folder, which is the exported model.

UI support for Onnx models

MoeSS

Note: For Hubert Onnx models, please use the models provided by MoeSS. Currently, they cannot be exported on their own (Hubert in fairseq has many unsupported operators and things involving constants that can cause errors or result in problems with the input/output shape and results when exported.) Hubert4.0

Some legal provisions for reference

Civil Code

Article 1019

No organization or individual may infringe upon the portrait right of another person by scandalizing, defacing, or falsifying by means of information technology. Without the consent of the portrait right holder, the portrait of the portrait right holder may not be produced, used, or made public, except as otherwise provided by law. Without the consent of the portrait right, the portrait work right holder shall not use or disclose the portrait of the portrait right holder by publication, reproduction, distribution, rental, exhibition, etc. The protection of the voice of a natural person, with reference to the application of the relevant provisions of the protection of portrait rights.

Article 104

The right to reputation] civil subjects enjoy the right to reputation. Any organization or individual shall not infringe upon the right of reputation of others by insulting, defaming, etc.

Article One Thousand and Twenty-seven

The victim has the right to request the perpetrator to assume civil liability if the literary or artistic work published by the perpetrator contains insulting or defamatory content by describing a real person or a specific person, which infringes on the right to reputation of others. The perpetrator of the literary or artistic works published by the perpetrator does not use a specific person as the object of description, only the circumstances of which are similar to the specific person's situation, shall not bear civil liability.

thecooltechguy / so-vits-svc-4.0-v2