forced-alignment-tools

A collection of links and notes on forced alignment tools

Version: 1.0.8
Date: 2018-03-03
Author: Alberto Pettarin (contact)
License: Creative Commons Attribution 4.0 International (CC BY 4.0)

Did I miss an aligner? Please open an issue or directly fork-commit-pullrequest.

Definition of Forced Alignment

Given an audio file containing speech, and the corresponding transcript, computing a forced alignment is the process of determining, for each fragment of the transcript, the time interval (in the audio file) containing the spoken text of the fragment.

A text fragment can have arbitrary granularity:

a paragraph,
a sentence,
a portion of a sentence (i.e., a group of words),
a word, or
a phoneme (i.e., a single sound).

For example, given this text file and this audio file, a force aligment at verse-level can be the following:

1                                                     => [00:00:00.000, 00:00:02.640]
From fairest creatures we desire increase,            => [00:00:02.640, 00:00:05.880]
That thereby beauty's rose might never die,           => [00:00:05.880, 00:00:09.240]
But as the riper should by time decease,              => [00:00:09.240, 00:00:11.920]
His tender heir might bear his memory:                => [00:00:11.920, 00:00:15.280]
...
Pity the world, or else this glutton be,              => [00:00:43.640, 00:00:48.080]
To eat the world's due, by the grave and thee.        => [00:00:48.080, 00:00:53.240]

Typical applications of forced alignment include Audio-eBooks, closed captioning, and automating the creation of training data for automated speech recognition systems.

Programs and Libraries

The following matrix contains open source programs and libraries for computing forced alignments that have been actually proven to install and run (albeit the installation procedure for some of them is pretty complex).

All tools, except aeneas, are based on speech recognition algorithms; all tools, except aeneas and gentle, are maintained by research groups or individuals in academia.

Most tools are based on the HTK, which is not free for commercial purposes, although a commercial license can be purchased from the University of Cambridge.

You can also download the raw data file in JSON format.

Name	Algorithm	Supported Language(s)	Interface	Code Language(s)	License	Documentation	Mailing List/Forum	Active	Notes
aeneas	DTW	30+	CLI, LIB, Web	Python, C	AGPL	Y	Y	Y	Not based on ASR
CMU Sphinx	HMM (own), RNN	11	CLI, LIB	C, Java, Python	MIT-like	Y	Y	Y
DARLA	HMM (HTK)	English	Web	?	?	Y	N	N?	Based on Prosodylab-Aligner or YouTube ASR
FAVE-align	HMM (HTK)	English	CLI, (Web)	Python	GPL	Y	Y	Y	acustic models from P2FA; GitHub code updated more frequently than Web
Gentle	HMM (Kaldi)	English	CLI, Web	Python	MIT	N	N	Y	Based on Kaldi
Julius	HMM (own)	English, Japanese	CLI, LIB	C	MIT-like	Y	Y	N?
Kaldi	HMM (own), DNN, RNN	English	CLI, LIB	C++	Apache	Y	Y	Y	CUDA support
kaldi-dnn-ali-gop	HMM(Kaldi), DNN(Kaldi nnet3)	English	CLI, LIB	Shell Script, C++, Python	GPL	N	N	Y	Work with other languages given kaldi acoustic models
LaBB-CAT	HMM (HTK)	English	Web	Java	GPL	Y	Y	Y
MAUS	HMM (HTK)	10	CLI, Web	C	All rights reserved	README	N	Y
Montreal Forced Aligner	HMM (Kaldi)	English	CLI	Python	MIT	Y	N	Y	Can train other languages
Penn Forced Aligner (P2FA)	HMM (HTK)	English	CLI, Web	Python	?	README, Tutorial	N	N?
Prosodylab-Aligner	HMM (HTK)	English	CLI	Python	MIT	README, Tutorial	N	Y	Can train other languages
SailAlign	HMM (HTK)	English, Greek, Spanish	CLI	Perl	GPL	README	N	N?
SPPAS	HMM (Julius)	12+	CLI, GUI	Python	GPL	Y	Y	Y	Can train other language, several plugins

AGPL: GNU Affero General Public License
Apache: Apache License
CLI: command line interface
DNN: Deep Neural Network
DTW: Dynamic Time Warping
GPL: GNU General Public License
GUI: graphical interface
HMM: Hidden Markov Model
LIB: library callable by third party software
MFCC: Mel-frequency Cepstral Coefficients
MIT: MIT License
RNN: Recurrent Neural Network
Web: Web-based graphical interface, local and/or remote

Additional Pointers

AZP2FA (fork of P2FA)
Automated Audio Segmentation Using Forced Alignment
Automatic and Accurate Captioning (based on CMU Sphinx)
Berkeley Phonetics Machine
Building Acoustic Models using Kaldi Voxforge recipe to obtain word level transcripts for long video files
DARLA
EasyAlign: phonetic alignment with Praat
FAVE-align (the Web interface for the Penn Forced Aligner)
FAVE-align (source code)
Forced Alignment Overview (ISIP)
Forced Alignment and Speech Recognition Systems (Oxford)
Forced Alignment of Spoken Audio
Forced Alignment with InproTK (and Sphinx)
Gentle (based on Kaldi)
HTKBook (has a chapter on computing forced alignments with HTK, requires registration)
InproTK
Introduction to Speech Analysis with FAVE
Julius
Kaldi Forced Alignment
Kaldi
Korean Phonetic Aligner (Web only, Korean only)
LaBB-CAT
Long Audio Aligner Landed in Trunk (Sphinx)
MAUS
Montreal Forced Aligner
Penn Forced Aligner
Penn Forced Aligner
Praatalign: an interactive Praat plug-in for performing phonetic forced alignment
ProsodyLab-Aligner
Robust Automatic Transcription of Speech (RATS)
SPPAS Automatic Annotation of Speech (based on Julius)
Simple English Forced Alignment (UPenn LING521)
VoxForge
WebMAUS (the Web interface for MAUS)
What is forced alignment? (ICSI)
What is forced alignment? (VoxForge))
aeneas
speech.zone

tbright17 / forced-alignment-tools

forced-alignment-tools

Definition of Forced Alignment

Programs and Libraries

Additional Pointers

About

Languages