rupakvignesh / Lyrics-to-Audio-Alignment

Aligns text (lyrics) with monophonic singing voice (audio). The algorithm uses structural segmentation to segment the audio into structures and then uses hidden markov models to obtain alignment within segments. The final alignment is concatenation of time stamps of lyrics within the segments for each song.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Lyrics-to-Audio-Alignment

This project aims at creating an automatic alignment between the textual lyrics and monophonic singing vocals (audio). This system shall be very useful in a setting where a karoake performer would want to keep in sync with the background track. Traditional Hidden Markov Models are used for phoneme modelling and an interesting structural segmentation approach has been explored to break the audio (usually of length 4-5 minutes) to smaller chunks that are structurallly meaningful (Intro, Verse, Chorus, etc) without any implicit assumptions.

Watch the Demo

Video link

Pre-requisites

Training Steps

Training Acoustic models

TIMIT

  • Create initial hmm models (isolated phoneme training)
tcsh scripts/model_gen.sh <phonelist> <proto_file>
  • Create connected HMM models (embedded re-estimation)
tcsh script/embedded_reestimation.sh <iterations>

Damp

  • Align Damp dataset with the generated HMM Models using forced Viterbi alignment
  • Perform embedded reestimation using the Damp Dataset to refine the phoneme models.

Structural Segmentation

  • Use MSAF library to segment Damp training data into structural segments
python scripts/msaf_segmentation.py <wav_in_dir> <wav_out_dir>
  • Create MLF files corresponding to the segmented audio
python scripts/msaf_to_mlf.py <labfile_list>
  • Perform embedded reestimation within these segments to get the final phoneme models

Testing

  • To test any model do the forced Viterbi alignment initially
sh scripts/force_align.sh

Set the parameters such as model, features, mlf, dictionary, etc inside the file.

  • To evaluate the performance of the model, use the manually annotated groundtruth and compute overlap.
python scripts/lab_to_lrc.py <lyrics_list>

Set the groundtruth and output folder inside the script.

Authors

Acknowledgments

  • Thanks to Alex Lerch for his guidance
  • S Aswin Shanmugham's hybrid segmentation framework
  • Stanford's DAMP dataset.

About

Aligns text (lyrics) with monophonic singing voice (audio). The algorithm uses structural segmentation to segment the audio into structures and then uses hidden markov models to obtain alignment within segments. The final alignment is concatenation of time stamps of lyrics within the segments for each song.


Languages

Language:Python 57.8%Language:Perl 24.0%Language:Shell 18.1%