A collection of links and notes on forced alignment tools
- Version: 1.0.8
- Date: 2018-03-03
- Author: Alberto Pettarin (contact)
- License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Did I miss an aligner? Please open an issue or directly fork-commit-pullrequest.
Given an audio file containing speech, and the corresponding transcript, computing a forced alignment is the process of determining, for each fragment of the transcript, the time interval (in the audio file) containing the spoken text of the fragment.
A text fragment can have arbitrary granularity:
- a paragraph,
- a sentence,
- a portion of a sentence (i.e., a group of words),
- a word, or
- a phoneme (i.e., a single sound).
For example, given this text file and this audio file, a force aligment at verse-level can be the following:
1 => [00:00:00.000, 00:00:02.640]
From fairest creatures we desire increase, => [00:00:02.640, 00:00:05.880]
That thereby beauty's rose might never die, => [00:00:05.880, 00:00:09.240]
But as the riper should by time decease, => [00:00:09.240, 00:00:11.920]
His tender heir might bear his memory: => [00:00:11.920, 00:00:15.280]
...
Pity the world, or else this glutton be, => [00:00:43.640, 00:00:48.080]
To eat the world's due, by the grave and thee. => [00:00:48.080, 00:00:53.240]
Typical applications of forced alignment include Audio-eBooks, closed captioning, and automating the creation of training data for automated speech recognition systems.
The following matrix contains open source programs and libraries for computing forced alignments that have been actually proven to install and run (albeit the installation procedure for some of them is pretty complex).
All tools, except aeneas, are based on speech recognition algorithms; all tools, except aeneas and gentle, are maintained by research groups or individuals in academia.
Most tools are based on the HTK, which is not free for commercial purposes, although a commercial license can be purchased from the University of Cambridge.
You can also download the raw data file in JSON format.
Name | Algorithm | Supported Language(s) | Interface | Code Language(s) | License | Documentation | Mailing List/Forum | Active | Notes |
---|---|---|---|---|---|---|---|---|---|
aeneas | DTW | 30+ | CLI, LIB, Web | Python, C | AGPL | Y | Y | Y | Not based on ASR |
CMU Sphinx | HMM (own), RNN | 11 | CLI, LIB | C, Java, Python | MIT-like | Y | Y | Y | |
DARLA | HMM (HTK) | English | Web | ? | ? | Y | N | N? | Based on Prosodylab-Aligner or YouTube ASR |
FAVE-align | HMM (HTK) | English | CLI, (Web) | Python | GPL | Y | Y | Y | acustic models from P2FA; GitHub code updated more frequently than Web |
Gentle | HMM (Kaldi) | English | CLI, Web | Python | MIT | N | N | Y | Based on Kaldi |
Julius | HMM (own) | English, Japanese | CLI, LIB | C | MIT-like | Y | Y | N? | |
Kaldi | HMM (own), DNN, RNN | English | CLI, LIB | C++ | Apache | Y | Y | Y | CUDA support |
kaldi-dnn-ali-gop | HMM(Kaldi), DNN(Kaldi nnet3) | English | CLI, LIB | Shell Script, C++, Python | GPL | N | N | Y | Work with other languages given kaldi acoustic models |
LaBB-CAT | HMM (HTK) | English | Web | Java | GPL | Y | Y | Y | |
MAUS | HMM (HTK) | 10 | CLI, Web | C | All rights reserved | README | N | Y | |
Montreal Forced Aligner | HMM (Kaldi) | English | CLI | Python | MIT | Y | N | Y | Can train other languages |
Penn Forced Aligner (P2FA) | HMM (HTK) | English | CLI, Web | Python | ? | README, Tutorial | N | N? | |
Prosodylab-Aligner | HMM (HTK) | English | CLI | Python | MIT | README, Tutorial | N | Y | Can train other languages |
SailAlign | HMM (HTK) | English, Greek, Spanish | CLI | Perl | GPL | README | N | N? | |
SPPAS | HMM (Julius) | 12+ | CLI, GUI | Python | GPL | Y | Y | Y | Can train other language, several plugins |
- AGPL: GNU Affero General Public License
- Apache: Apache License
- CLI: command line interface
- DNN: Deep Neural Network
- DTW: Dynamic Time Warping
- GPL: GNU General Public License
- GUI: graphical interface
- HMM: Hidden Markov Model
- LIB: library callable by third party software
- MFCC: Mel-frequency Cepstral Coefficients
- MIT: MIT License
- RNN: Recurrent Neural Network
- Web: Web-based graphical interface, local and/or remote
- AZP2FA (fork of P2FA)
- Automated Audio Segmentation Using Forced Alignment
- Automatic and Accurate Captioning (based on CMU Sphinx)
- Berkeley Phonetics Machine
- Building Acoustic Models using Kaldi Voxforge recipe to obtain word level transcripts for long video files
- DARLA
- EasyAlign: phonetic alignment with Praat
- FAVE-align (the Web interface for the Penn Forced Aligner)
- FAVE-align (source code)
- Forced Alignment Overview (ISIP)
- Forced Alignment and Speech Recognition Systems (Oxford)
- Forced Alignment of Spoken Audio
- Forced Alignment with InproTK (and Sphinx)
- Gentle (based on Kaldi)
- HTKBook (has a chapter on computing forced alignments with HTK, requires registration)
- InproTK
- Introduction to Speech Analysis with FAVE
- Julius
- Kaldi Forced Alignment
- Kaldi
- Korean Phonetic Aligner (Web only, Korean only)
- LaBB-CAT
- Long Audio Aligner Landed in Trunk (Sphinx)
- MAUS
- Montreal Forced Aligner
- Penn Forced Aligner
- Penn Forced Aligner
- Praatalign: an interactive Praat plug-in for performing phonetic forced alignment
- ProsodyLab-Aligner
- Robust Automatic Transcription of Speech (RATS)
- SPPAS Automatic Annotation of Speech (based on Julius)
- Simple English Forced Alignment (UPenn LING521)
- VoxForge
- WebMAUS (the Web interface for MAUS)
- What is forced alignment? (ICSI)
- What is forced alignment? (VoxForge))
- aeneas
- speech.zone