English | 简体中文
SOFA (Singing-Oriented Forced Aligner) is a forced alignment tool designed specifically for singing voice.
It has the following advantages:
- Easy to install
Note: SOFA is still in beta and may contain many bugs, and effectiveness is not guaranteed. If any issues are encountered or improvements are suggested, please feel free to raise an issue.
- Use
git clone
to download the code from this repository - Install conda
- Create a conda environment, requiring Python version
3.8
conda create -n SOFA python=3.8 -y conda activate SOFA
- Go to the pytorch official website to install torch
- (Optional, to improve wav file reading speed) Go to the pytorch official website to install torchaudio
- Install other Python libraries
pip install -r requirements.txt
-
Download the model file. You can find the trained models in the releases of this repository with the
.ckpt
file extension. -
Move the model file to the
/ckpt
folder. -
Place the dictionary file in the
/dictionary
folder. The default dictionary isopencpop-extension.txt
-
Prepare the data for forced alignment and place it in a folder (by default in the
/segments
folder), with the following format- segments - singer1 - segment1.lab - segment1.wav - segment2.lab - segment2.wav - ... - singer2 - segment1.lab - segment1.wav - ...
Ensure that the
.wav
files and their corresponding.lab
files are in the same folder. -
Command-line inference
Use
python infer.py
to perform inference.Parameters that need to be specified:
--ckpt
: (must be specified) The path to the model weights;--folder
: The folder where the data to be aligned is stored (default issegments
);--dictionary
: The dictionary file (default isdictionary/opencpop-extension.txt
);
python infer.py --ckpt checkpoint_name --folder segments_path --dictionary dictionary_path
- Using a custom g2p instead of a dictionary
- In the matching mode, you can activate it by specifying
-m
during inference. It finds the most probable contiguous sequence segment within the given phoneme sequence, rather than having to use all the phonemes.
-
Follow the steps above for setting up the environment. It is recommended to install torchaudio for faster binarization speed;
-
Place the training data in the
data
folder in the following format:- data - full_label - singer1 - wavs - audio1.wav - audio2.wav - ... - transcriptions.csv - singer2 - wavs - ... - transcriptions.csv - weak_label - singer3 - wavs - ... - transcriptions.csv - singer4 - wavs - ... - transcriptions.csv - no_label - audio1.wav - audio2.wav - ...
Where:
transcriptions.csv
only needs to have the correct relative path to thewavs
folder;The
transcriptions.csv
inweak_label
does not need to have aph_dur
column; -
Modify
binarize_config.yaml
as needed, then executepython binarize.py
; -
Download the pre-trained model you need from releases, modify
train_config.yaml
as needed, then executepython train.py -p path_to_your_pretrained_model
; -
For training visualization:
tensorboard --logdir=ckpt/
.