🗼CCAligner

Word by word audio subtitle synchronization tool and API. Being developed under Google Summer of Code 2017 with CCExtractor.

(https://saurabhshri.github.io/)

There are two major branches in this repository - master and development.

The master branch is updated at the end of each phase (currently it has code which was written in the second phase) and the development branch is regularly updated as the work proceeds. At the end of each phase, the development branch is merged with the master branch.

The project is in it’s very early stage and is constantly evolving. The available functions, usage instructions et cetera are expected to refactor over time. It is not production ready but you are welcome to play with it, or better, help improve it! :)

Using CCAligner

CCAligner can be used as both standalone tool or a library in your own project.

Installing Dependencies

To automatically generate language models, dictionaries and grammars, following dependencies need to be met. The tool has capability to generate them without these dependencies, but the accuracy in that case is not guaranteed. It is highly recommended to work with the dependencies installed.

cmuclmtk : to generate vocab and LM. (https://sourceforge.net/projects/cmusphinx/files/cmuclmtk/0.7/cmuclmtk-0.7.tar.gz/download)
g2p-seq2seq (to generate dictionary). (https://github.com/cmusphinx/g2p-seq2seq)

Steps :

Linux/MacOS

To install cmuclmtk :

Download cmuclmtk from the link mentioned above and uncompress cmuclmtk-0.7.tar.gz while preserving the permissions :
```
tar xvpzf cmuclmtk-0.7.tar.gz
```
Navigate to cmumltk-0.7 directory :
```
cd cmuclmtk-0.7
```
Install :
```
./configure
make
sudo make install
```

You may have to run sudo ldconfig to fix errors such as missing shared library.

To install g2p-seq2seq :

First, install Tensorflow by your preferred choice of method. If you are on Linux (x86_64), you may directly run the following :
```
sudo pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.0.0-cp27-none-linux_x86_64.whl
```
Download g2p-seq2seq from the link mentioned above and uncompress g2p-seq2seq-master.zip while preserving the permissions :
```
unzip g2p-seq2seq-master.zip
```
Navigate to g2p-seq2seq-master directory and run :
```
sudo python setup.py install
```

I have incorporated above steps into a bash script located inside install/ directory, that you may simply call using :

`./install_grammar_tools.sh`

In case you have some trouble with dependencies, I have mirrored the last known working versions of them in the repository at : (https://github.com/saurabhshri/mirror/) .

The alternate ways of generating language models, dictionaries and grammars are covered later in the docs.

Windows

To install cmuclmtk :

Download cmuclmtk from the link mentioned above and uncompress cmuclmtk-0.7.tar.gz. Build it with the cmuclmtk.sln it provides.
Copy the compiled files (wfreq2vocab.exe, text2wfreq.exe, text2idngram.exe, idngram2lm.exe) to the same directory as the CCAligner or you can set the directory contains these files to environment variable (PATH)

To install g2p-seq2seq :

First, install Python 3.5 (64-bit) and Tensorflow 1.0.0 by your preferred choice of method
Download g2p-seq2seq from the link mentioned above and uncompress g2p-seq2seq-master.zip
Navigate to g2p-seq2seq-master directory and run :
```
python setup.py install
```

You will also need to install Perl and move install/quick_lm.pl to the same directory as the CCAligner or a dictionary that is set in the environment variable PATH

Before You Run

Please make sure you have all the dependencies installed in case you want to use grammar tools. To disable generating grammar by CCAligner, issue --generate-grammar no.
Make sure the model folder and g2p-seq2seq-cmudict are in the directory where you are compiling CCAligner.
Make sure the subtitles are clean and are in propar SRT format.
The wav file should be 16 bit PCM mono sampled at 16KHz. To generate the wavefile using a video through ffmpeg, you may :
```
./ffmpeg -i input.video -bits_per_raw_sample 16 -ar 16000 -ac 1 output.wav
```

Installing

Linux/MacOS

Clone the repository from Github using :

git clone https://www.github.com/saurabhshri/CCAligner.git

Navigate to install directory and run build.sh.
```
cd install/
./build.sh
```
Align!
```
./ccaligner <arguments>
```

Windows

Clone the repository from Github using :

git clone https://www.github.com/saurabhshri/CCAligner.git

Use CMake to generate project files, and then build it.
Align!
```
.\ccaligner <arguments>
```

Quick Demo

The default output of CCAligner is stored as an XML file. For example, the next command will generate file.xml :

./ccaligner -wav /path/to/file.wav -srt /path/to/file.srt

Generated Output Snippet :

.
.
<subtitle>
    <start>12780</start>
    <dialogue>I was offered a summer research      fellowship at Princeton.    </dialogue>
    <edited_dialogue>I was offered a summer research fellowship at Princeton</edited_dialogue>
        <words>
            <word>
                <recognised>0</recognised>
                <text>I</text>
                <start>12780</start>
                <end>12911</end>
                <duration>131</duration>
            </word>
            <word>
                <recognised>1</recognised>
                <text>was</text>
                <start>13030</start>
                <end>13330</end>
                <duration>300</duration>
            </word>
            <word>
                <recognised>1</recognised>
                <text>offered</text>
                <start>13400</start>
                <end>13770</end>
                <duration>370</duration>
            </word>
            .
            .
            .
        </words>
    <end>16382</end>
</subtitle>
.
.

API or Library usage

Clone the repository from Github :

git clone https://github.com/saurabhshri/CCAligner.git

Place the CCAligner folder in appropriate directory in your project.
In your project, simply include the directories and source file you wish to use. You may refer to CMakeLists.txt in the src/ directory to get an idea. The CCAligner tool is built around the CCAligner API.

For example : If you want to use the audio based alignment in your project

//include the header file
#include "recognize_using_pocketsphinx.h"

//Declare the aligner
PocketsphinxAligner * aligner = new PocketsphinxAligner(_parameters);

//Align
aligner->align();

//Print the result
aligner->printAligned("Manual_Printing.json", json);

//delete the aligner
delete(aligner);

Complete documentation of the API will be written under docs.

Some Previews

Click on video thumbnail or link to watch the video on YouTube.

	Word by Word Audio Subtitle Synchronization - Karaoke Demo 1 (https://www.youtube.com/watch?v=38_27E1PxXA) [Sitcom]
	Word by Word Audio Subtitle Synchronization - Karaoke Demo 2 (https://www.youtube.com/watch?v=6VnhC8u_d40) [Ted Talk]
	Word by Word Audio Subtitle Synchronization - Karaoke Demo 3 (https://www.youtube.com/watch?v=j_zeixo-zJY) [Cartoon Show]
	Word by Word Audio Subtitle Synchronization - Karaoke Demo 1 (https://www.youtube.com/watch?v=8tTDX6NZGsU) [Discussion Video]
	Word by Word Audio Video Transcription Demo (https://www.youtube.com/watch?v=tFrf0TVnqIQ) [Reality Show]
	Approximate Word by Word Audio Subtitle Synchronization (https://www.youtube.com/watch?v=km1iHe_mGuo)

Usage Parameters

The following is a complete list of available parameters that can be passed to CCAligner. Feel free to open a PR if you spot a missing parameter.

Input related parameters :

Parameter Accepted Values Description

Parameter	Accepted Values	Description
`-wav`	`/path/to/wav_file`	Provide path to input audio wave file. Wave file must be 16 bit PCM mono sampled at 16KHz. E.g.: `ccaligner -wav tbbt.wav -srt tbbt.srt` Required : yes.
`-srt`	`/path/to/subtitle_file`	Provide path to subtitle file in SRT format. Please ensure that the subtitle file is clean and in proper format. E.g.: `ccaligner -wav tbbt.wav -srt tbbt.srt` Required : yes.
`-stdin` or `-`	Audio wave file from stdin or pipe.	Use this parameter to pass wav file from `stdin` or pipe. E.g.: `cat tbbt.wav \| ccaligner -stdin -srt tbbt.srt`

-wav

/path/to/wav_file

Provide path to input audio wave file. Wave file must be 16 bit PCM mono sampled at 16KHz.

E.g.: ccaligner -wav tbbt.wav -srt tbbt.srt

Required : yes.

-srt

/path/to/subtitle_file

Provide path to subtitle file in SRT format. Please ensure that the subtitle file is clean and in proper format.

E.g.: ccaligner -wav tbbt.wav -srt tbbt.srt

Required : yes.

-stdin or -

Audio wave file from stdin or pipe.

Use this parameter to pass wav file from stdin or pipe.

E.g.: cat tbbt.wav | ccaligner -stdin -srt tbbt.srt

Output related parameters :

Parameter	Accepted Values	Description
`-out`	`/path/to/output_file`	Provide name and path to generated to output file. By default the output name is extracted from input file and generated in same location in which the input file is located. E.g.: `ccaligner -wav tbbt.wav -srt tbbt.srt -out my_output.xml`
`-oFormat`	`xml`, `json`, `srt`, `karaoke`, `stdout`	To choose output format. By default the output format is XML. E.g.: `ccaligner -wav tbbt.wav -srt tbbt.srt -out output_as_karaoke.srt -oFormat karaoke`
`-log`	`/path/to/aligner_log_file/`	Specify path to logfile for PocketSphinx decoder. By default stores log in `tempFiles/{execution_timestamp}.log` E.g.: `ccaligner -wav tbbt.wav -srt tbbt.srt -log tbbt.log`
`-phoneLog`	`/path/to/phoneme_log_file/`	Specify path to logfile for PocketSphinx phoneme decoder. By default stores log in `tempFiles/phoneme_{execution_timestamp}.log` E.g.: `ccaligner -wav tbbt.wav -srt tbbt.srt -phoneLog tbbt_phoneme.log`

Alignment related parameters :

Parameter	Accepted Values	Description
`-approx`	`yes`, `no`	Use approx aligner instead of audio based aligner. Calculated timing of words based on it’s weight. Super fast and doesn’t involve audio analysis. Please be aware the result is not accurate but approximate. E.g.: `ccaligner -wav tbbt.wav -srt tbbt.srt -approx yes`
`--enable-phonemes`	`yes`, `no`	Recognise and find phonemes and their timestamps along with words. SRT and Karaoke output can not display phonemes. E.g.: `ccaligner -wav tbbt.wav -srt tbbt.srt --enable-phonemes yes`
`-transcribe`	`yes`, `no`	Performs transcription of complete audio instead of searching using timestamps and subs. Use this when timings in subtitles are incorrect or you want YouTube like transcription of video. E.g.: `ccaligner -wav tbbt.wav -srt tbbt.srt -transcribe yes`
`--use-fsg`	`yes`, `no`	Instruct CCAligner to follow Finite State Grammar while performing recognition. E.g.: `ccaligner -wav tbbt.wav -srt tbbt.srt --use-fsg yes`
`-useBatchMode`	`yes`, `no`	Instruct CCAligner to use batch mode of PocketSphinx. May improve accuracy by flushing CMN values. E.g.: `ccaligner -wav tbbt.wav -srt tbbt.srt -useBatchMode yes`
`-experiment`	`yes`, `no`	Use experimental parameters. May improve accuracy in some cases. E.g.: `ccaligner -wav tbbt.wav -srt tbbt.srt -experiment yes`
`-searchWindow`	An integer	Determine the extent to which current recognised word is searched in the respective subtitle dialogue. Default value is 3. E.g.: `ccaligner -wav tbbt.wav -srt tbbt.srt -searchWindow 6`
`-audioWindow`	An integer	Determine the frontal and rear window from current subtitle timing to perform recognition. The value should be in milliseconds. Default value is 0. E.g.: `ccaligner -wav tbbt.wav -srt tbbt.srt -audioWindow 500`
`-sampleWindow`	An integer	Determine the frontal and rear window from current subtitle timing to perform recognition. The value should be in number of samples. Default value is 0. E.g.: `ccaligner -wav tbbt.wav -srt tbbt.srt -sampleWindow 500`

Grammar, Language Model related parameters :

Parameter	Accepted Values	Description
`--generate-grammar`	`yes`, `no`, `onlyCorpus`, `onlyDict`, `onlyFSG`, `onlyLM`, `onlyVocab`	Parameter deciding if and which type of grammar/lm to be generated. Once you have generated these files, no need to generate them again. They are stored in `tempFiles/{respective_dir}`. Also, use this when supplying files manually. E.g.: `ccaligner -wav tbbt.wav -srt tbbt.srt --generate-grammar no`
`-model`	`path/to/acoustic/model`	Enter path of acoustic model to be used by aligner. Accuracy highly depends on the acoustic model. E.g.: `ccaligner -wav tbbt.wav -srt tbbt.srt -lm custom.lm`
`-lm`	`path/to/language/model`	Enter path of language model to be used by aligner. E.g.: `ccaligner -wav tbbt.wav -srt tbbt.srt -lm custom.lm`
`-dict`	`path/to/dictionary`	Enter path of dictionary to be used by aligner. E.g.: `ccaligner -wav tbbt.wav -srt tbbt.srt -dict custom.dict`
`-fsg`	`path/to/fsg/directory`	Enter path of the directory containing FSGs, each FSG with name as starting timestamp of dialogue. E.g.: `ccaligner -wav tbbt.wav -srt tbbt.srt -fsg fsg/`
`-phoneLM`	`path/to/phonetic/language/model`	Enter path of phonetic language model to be used by aligner. E.g.: `ccaligner -wav tbbt.wav -srt tbbt.srt -fsg fsg/`
`--quick-dict`	`yes`,`no`	Generate dictionary quickly without using TensorFlow and seq2seq. Result might not give best accuracy. E.g.: `ccaligner -wav tbbt.wav -srt tbbt.srt --quick-dict yes`
`--quick-lm`	`yes`,`no`	Generate language model quickly without using cmuclmtk. Result might not give best accuracy. E.g.: `ccaligner -wav tbbt.wav -srt tbbt.srt --quick-dict yes`

Display related parameters :

Parameter Accepted Values Description

Parameter	Accepted Values	Description
`-verbose`	`yes`, `no`	Turns verbosity on and off. Turn off for preventing [info] logs. E.g.: `ccaligner -wav tbbt.wav -srt tbbt.srt -verbose no`
`--display-recognised`	`yes`, `no`	Determine whether to display the recognised words and matching status on stdout or not. E.g.: `ccaligner -wav tbbt.wav -srt tbbt.srt --display-recognised no`

-verbose

yes, no

Turns verbosity on and off. Turn off for preventing [info] logs.

E.g.: ccaligner -wav tbbt.wav -srt tbbt.srt -verbose no

--display-recognised

yes, no

Determine whether to display the recognised words and matching status on stdout or not.

E.g.: ccaligner -wav tbbt.wav -srt tbbt.srt --display-recognised no

Project Details

The usual subtitle files (such as SubRips) have line by line synchronization in them i.e. the subtitles containing the dialogue appear when the person starts talking and disappears when the dialogue finishes. This continues for the whole video. For example :

1274
01:55:48,484 --> 01:55:50,860
The Force is strong with this one

In the above example, the dialogue #1274 - The Force is strong with this one appears at 1:55:48 remains in the screen for two seconds and disappears at 1:55:50.

The aim of the project is to tag the word as it is spoken, similar to that in karaoke systems.

E.g.

The           [6948484:6948500]
Force         [6948501:6948633]
is            [6948634:6948710]
strong        [6948711:6949999]
with          [6949100:6949313]

In the above example each word from subtitle is tagged with beginning and ending timestamps based on audio.

Important Links

Project link on official GSoC web-app : https://summerofcode.withgoogle.com/projects/#5589068587991040
Project repository on Github: https://github.com/saurabhshri/CCAligner
Weekly blog : https://saurabhshri.github.io
Milestones and deilverable checklist : https://saurabhshri.github.io/gsoc/
Mentors : @cfsmp3 and @AlexBratosin2001

Credits and Licensing

I haven’t decided the license for the tool yet, but all the individual licenses of libraries and code used can be found under license/ directory.

I have tried my best to ensure that credit and reference is given in the source wherever it is due. In case I have missed any reference/license, firstly please accept my apology. Feel free to reach out to me and I’ll be happy to correct my mistake. 🤝

Contributing

The project is under constant development, and needs a lot of brushing and bug fixes. Feel free to contribute in any way. Your contribution will be highly appreciated! 🙂

harrynull / CCAligner