BEGANSing + RVC + AudioSuperResolution

Korean Singing Voice Synthesis + Singing Voice Conversion(SVS + SVC)

The system generates singing voice from a given text and MIDI in an end-to-end manner.

Overview of the proposed system

Installation
Prepare Datasets
Configuration
Preprocessing & Training
Usage
Results
To-Do
References

Installation

A Windows/Linux system with a minimum of 16GB RAM.
A GPU with at least 12GB of VRAM.
Python >= 3.8
Anaconda installed.
Pytorch installed.
CUDA 11.7 installed.

Pytorch install command:

pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118

CUDA 11.8 install:

https://developer.nvidia.com/cuda-11-8-0-download-archive

Create an Anaconda environment:

conda create -n begansing python=3.9

Activate the environment:

conda activate begansing

Clone this repository to your local machine:

git clone https://github.com/ORI-Muchim/BEGANSing.git

Navigate to the cloned directory:

cd BEGANSing

Install the necessary dependencies:

pip install -r requirements.txt

Prepare Dataset

Inside the cloned folder, there is a folder called ./test_datasets. You can put the MIDI file and text file in it according to the format. MIDI and text should be arranged in the same number unconditionally. As an example, I will provide GFRIEND's "Rough" MIDI and text. And for the dataset to change the voice from the generated vocals, you can create a folder with the speaker's name in the ./datasets folder and put voice data for Retrieval Voice Conversion (RVC) in it. The following shows the ./datasets format.

BEGANSing
├────datasets
│       ├───kss
│       │   ├────1_0000.wav
│       │   ├────1_0001.wav
│       │   └────...
│       ├───{speaker_name}
│       │    ├───1.wav
└───────└────└───2.wav

This is just an example, and it's okay to add more speakers.

Preprocessing & Training

This pre-trained model is a model in which an additional 100 epochs was trained. For Preprocessing and Training, see Preprocessing, Training in the original repository.

Usage

python main.py {speaker_name} {song} {pitch_shift} --audiosr

If the speaker is male, it is recommended to set the {pitch_shift} value to -12, and if she is female, set it to 0.

The --audiosr option up-samples a voice generated at 22050hz to 48000hz. Use this option for those who have excellent graphics cards or don't mind taking a long time to generate a voice, or remove it if not.

Results

Audio samples at: https://soonbeomchoi.github.io/saebyulgan-blog/. Model was trained at RTX3090 24GB with batch size 32 for 2 days.

To-Do

Change Vocoder Griffin-Lim -> HiFi-GAN

References

g2p/korean_g2p.py from https://github.com/scarletcho/KoG2P
utils/midi_utils.py from Madmom, https://madmom.readthedocs.io/en/latest/

ORI-Muchim / BEGANSing