Lyrics-to-Audio-Alignment

This project aims at creating an automatic alignment between the textual lyrics and monophonic singing vocals (audio). This system shall be very useful in a setting where a karoake performer would want to keep in sync with the background track. Traditional Hidden Markov Models are used for phoneme modelling and an interesting structural segmentation approach has been explored to break the audio (usually of length 4-5 minutes) to smaller chunks that are structurallly meaningful (Intro, Verse, Chorus, etc) without any implicit assumptions.

Watch the Demo

Pre-requisites

[HTK tool-kit] (http://htk.eng.cam.ac.uk/download.shtml)
[sph2pipe] (https://www.ldc.upenn.edu/language-resources/tools/sphere-conversion-tools)
[Flite] (http://www.speech.cs.cmu.edu/flite/download.html)
[MSAF] (https://github.com/urinieto/msaf/releases)

Training Steps

Training Acoustic models

TIMIT

Create initial hmm models (isolated phoneme training)

tcsh scripts/model_gen.sh <phonelist> <proto_file>

Create connected HMM models (embedded re-estimation)

tcsh script/embedded_reestimation.sh <iterations>

Damp

Align Damp dataset with the generated HMM Models using forced Viterbi alignment
Perform embedded reestimation using the Damp Dataset to refine the phoneme models.

Structural Segmentation

Use MSAF library to segment Damp training data into structural segments

python scripts/msaf_segmentation.py <wav_in_dir> <wav_out_dir>

Create MLF files corresponding to the segmented audio

python scripts/msaf_to_mlf.py <labfile_list>

Perform embedded reestimation within these segments to get the final phoneme models

Testing

To test any model do the forced Viterbi alignment initially

sh scripts/force_align.sh

Set the parameters such as model, features, mlf, dictionary, etc inside the file.

To evaluate the performance of the model, use the manually annotated groundtruth and compute overlap.

python scripts/lab_to_lrc.py <lyrics_list>

Set the groundtruth and output folder inside the script.

Authors

Phoneme Acoustic Modelling - Rupak Vignesh
Structural Segmentation with MSAF - Benjamin Genchel

Acknowledgments

Thanks to Alex Lerch for his guidance
S Aswin Shanmugham's hybrid segmentation framework
Stanford's DAMP dataset.

About

Aligns text (lyrics) with monophonic singing voice (audio). The algorithm uses structural segmentation to segment the audio into structures and then uses hidden markov models to obtain alignment within segments. The final alignment is concatenation of time stamps of lyrics within the segments for each song.

lyrics alignment phonemes segmentation

Languages

Language:Python 57.8%Language:Perl 24.0%Language:Shell 18.1%