StyleTTS2 Fine-Tuning Guide

This repository provides a guide on how to prepare a dataset and execute fine-tuning using the StyleTTS2 process. https://github.com/yl4579/StyleTTS2

Changelog

5/15/2024: https://github.com/IIEleven11/SilenceRemover This repo is a fork of @jerryliuoft's https://github.com/jerryliuoft/SilenceRemover. It's a visual representation for the location, removal, and or addition of silence within media. The original repo I forked is specific to video, so it outputs an mp4. I will modify this soon to allow for the option of either audio or video output as to more align with our use case. It removes a lot of the guesswork that i've been doing with the energy and decible detection. I had to share it right away because it immediately saved me a ton of time. Take a look at it and throw em a star. It's a lifesaver.
3/24/2024: Phonemizer now capable of handling languages other than english. - Contributer: [@Scralius]
2/09/2024: Implemented new buffer for subtitles. This help with the segmentation process. See the "srtsegmenter.py" for more details. Added "add_padding.py" to add a length of silence to both ends of every audio segment.
2/08/2024: Added a script that adds a "silence buffer" within an audio file. This allows a larger margin of error during segmentation. Edited srtsegmenter.py, specifically the "end_time" variable now has to wait 600ms before it can make a cut. This combined with the silence buffer can help combat early segmentation. It was highly effective once I tuned the parameters correctly.
1/26/2024: Updated Readme for clarity and specifying seperate windows and linux whisperx commands.
1/12/2024: Added the ability to work with multiple SRT and Audio files at one time for large datasets or blended voices. - Contributor: [@78Alpha]
12/6/23: I noticed segmentation from the whisperx .json was unacceptable. I created a segmentation script that uses the .srt file that the whisperx command generates. From what I can tell this is significantly more accurate. This could be dataset specific. Use the json segmenter if needed.
12/5/23: Fixed a missing "else" in the Segmentation script.
12/4/23: A working config_ft.yml file is available in the tools folder.
12/2/23: Rewrote Segmentation and Transcription scripts.

Compatibility

The scripts are compatible with WSL2 and Linux. Windows requires additional dependencies and might not be worth the effort.

Setup

Environment Setup

Install conda and activate environment with Python 3.10:
- conda create --name dataset python==3.10
- conda activate dataset

Install Pytorch

- pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 -U

Install whisperx/phonemize and segmentation packages

- pip install git+https://github.com/m-bain/whisperx.git
- pip install phonemizer pydub pysrt

Instal TQDM progress bar

- pip install tqdm

Data Preparation

Change directory to where you have unpacked StyleTTSFineTune (You should see the makeDataset folder)
To make base directories you can run segmenter script. It will create the folders.
1. run python srtsegmenter.py
Add WAV audio file/s to the audio directory (remove special characters, brackets, parenthesis to prevent issues)
**** This step isnt mandatory **** for the training process. You can run whisperx and segmentation without adding silence. If you do want to add silence then silencebuffer.py within the tools folder will go over your audio file, find the silent portions between sentences/breaks in speech, and add a specific length of silence to them. This could in theory provide a more accurate cut during the segmentation process. You MUST adjust the parameters within the script to fit your data. I left the values that worked for my dataset in the code, you can try them as defaults if you wish.
Run the following command to generate srt files for all files in the audio folder:
- Linux -
```
for i in ../audio/*.wav; do whisperx "$i" --model large-v2 --align_model WAV2VEC2_ASR_LARGE_LV60K_960H; done
```
- Windows - in a powershell terminal copy and paste the following after verifying path to audio folder:
"Get-ChildItem -Path "\StyleTTS2FineTune\makeDataset\tools\audio" -Filter *.wav | ForEach-Object {whisperx $_.FullName --model large-v2 --align_model WAV2VEC2_ASR_LARGE_LV60K_960H"

This will generate a Whisperx .SRT file transcription of your audio. Place the srt file/s into the srt folder

Segmentation and Transcription

Navigate to the main directory (You should see the folder makeDataset)
Within srtsegmenter.py are some variables to adjust. buffer_time and max_allowed_gap and the final if statement has a desired range you can adjust. You can try to use the defaults I have set, they worked for me. BUT! Theres a chance this will not work out well for your dataset. The process I went through would be to adjust buffer_time then run srtsegmenter.py. Go listen to the segments in order, if they are overlapping, cut mid sentence, or have artifacts then go back and adjust buffer_time. Repeat until you get desired results.
Run the segmentation script (python makeDataset/tools/srtsegmenter.py)
Run the add_padding.py script to add a duration of silece to the end of each audio clip.

The above steps will generate a set of segmented audio files, a folder of bad audio it didn't like, and an output.txt file. I have it set to throw out segmemts under 1 second and over 11.6 seconds. You can adjust this to varying degrees.

Phonemization

Run the script (python makeDataset/tools/phonemized.py --language en-us). The --language argument refers to an espeak-ng voice, such as 'fr-fr' for French (default is en-us). Check the espeak-ng identifier for your language here.
This script will create the train_list.txt and val_list.txt files.

OOD_list.txt comes from the LibriTTS dataset. The following are some things to consider taken from the notes at yl4579/StyleTTS2#81. There is a lot of good information there, I suggest looking it over.

The LibriTTS dataset has poor punctuation and a mismatch of spoken/unspoken pauses with the transcripts. This is a common oversight in many datasets.
Also it lacks variety of punctuation. In the field, you may encounter texts with creative use of dashes, pauses and combination of quotes and punctuation. LibriTTS lacks those cases. But the model can learn these!
Additionally, LibriTTS has stray quotes in some texts, or begins a sentence with a quote. These things reduce quality a little (or a lot, sometimes). You will want to filter those out.
Creating your own ODD_list.txt is an option. I need to play around with it more, the only real requirements should be good punctuation and that it contains text the model has not seen. I'm not sure what the ideal size of this list should be though.

At this point, you should have everything you need to fine-tune.

Fine-Tuning with StyleTTS2

Clone the StyleTTS2 repository and navigate to its directory:
- git clone https://github.com/yl4579/StyleTTS2.git
Install the required packages:
- cd StyleTTS2
- pip install -r requirements.txt
- sudo apt-get install espeak-ng
Prepare the data and model:
- Clear the wavs folder in the data directory and replace with your segmented wav files.
- Replace the val_list and train_list files in the Data folder with yours. Keep the OOD_list.txt file.
- Adjust the parameters in the config_ft.yml file in the Configs folder according to your needs.
Download the StyleTTS2-LibriTTS model and place it in the Models/LibriTTS directory.
If the language of your dataset is not English, you will need to modify the PLBER model of StyleTTS. If this is your case, refer to this repository (don't forget to check if your language is supported).

Run

Finally, you can start the fine-tuning process with the following command:

accelerate launch --mixed_precision=fp16 --num_processes=1 train_finetune_accelerate.py --config_path ./Configs/config_ft.yml

splinter21 / StyleTTS2FineTune