voice2json profiles

Speech models and supporting files for voice2json.

Data

Files are contained in <LANGUAGE>/<LOCALE> directories. Each locale directory should contain a SOURCE file describing where it was sourced from. The LICENSE file in each locale directory covers the artifacts for that specific profile.

Directories with pocketsphinx contain CMU Sphinx acoustic models
Directories with kaldi contain Kaldi acoustic models (either gmm or nnet3).
Directories with deepspeech contain Mozilla DeepSpeech acoustic models (version 0.6).
Directories with julius contain Julius acoustic models (DNN, version 4.5).

Some files are split into multiple parts so that they can be uploaded to GitHub. This is done with the split command:

split -d -b 25M FILE FILE.part-

They can be recombined simply with:

cat FILE.part-* > FILE

Supported Languages

voice2json supports the following languages/locales. I don't speak or write any language besides U.S. English very well, so please let me know if any profile is broken or could be improved!

Untested profiles (highlighted below) may work, but I don't have the necessary data or enough understanding of the language to test them.

		Language	Locale	System	Closed	Open
View	Download	Catalan	ca-es	pocketsphinx	UNTESTED	UNTESTED
View	Download	Czech	cs-cz	Kaldi	UNTESTED	UNTESTED
View	Download	Dutch (Nederlands)	nl	kaldi	★ ★ ★ ★ ★ (2x)	☹ (1x)
View	Download	Dutch (Nederlands)	nl	pocketsphinx	★ ★ ★ ★ (18x)	☹ (3x)
View	Download	English	en-in	pocketsphinx	☹ (4x)	☹ (4x)
View	Download	English	en-us	deepspeech	★ ★ ★ ★ ★ (1x)	★ ★ ★ ★ (1x)
View	Download	English	en-us	julius	★ ★ ★ ★ (1x)	UNTESTED
View	Download	English	en-us	kaldi	★ ★ ★ ★ ★ (3x)	★ ★ ★ ★ (1x)
View	Download	English	en-us	pocketsphinx	★ ★ ★ ★ ★ (9x)	★ ★ ★ ★ (2x)
View	Download	French (Français)	fr	kaldi	★ ★ ★ ★ (4x)	★ ★ ★ ★ (1x)
View	Download	French (Français)	fr	kaldi	★ ★ ★ ★ ★ (3x)	★ ★ ★ ★ ★ (0.5x)
View	Download	French (Français)	fr	pocketsphinx	★ ★ ★ ★ (23x)	☹ (3x)
View	Download	German (Deutsch)	de	pocketsphinx	★ ★ ★ ★ ★ (17x)	★ ★ ★ ★ ★ (3x)
View	Download	German (Deutsch)	de-DE	deepspeech	★ ★ ★ ★ ★ (1x)	★ ★ ★ ★ (1x)
View	Download	German (Deutsch)	de-DE	kaldi	★ ★ ★ ★ ★ (4x)	★ ★ ★ ★ (1x)
View	Download	Greek (Ελληνικά)	el-gr	pocketsphinx	★ ★ ★ ★ ★ (15x)	☹ (1x)
View	Download	Hindi (Devanagari)	hi	pocketsphinx	UNTESTED	UNTESTED
View	Download	Italian (Italiano)	it	pocketsphinx	★ ★ ★ ★ ★ (21x)	★ ★ ★ ★ ★ (7x)
View	Download	Italian (Italiano)	it	kaldi	★ ★ ★ ★ ★ (1x)	★ ★ ★ ★ ★ (1x)
View	Download	Kazakh (қазақша)	kz	pocketsphinx	UNTESTED	UNTESTED
View	Download	Korean	ko-kr	kaldi	☹ (4x)	☹ (4x)
View	Download	Mandarin	zh-cn	pocketsphinx	UNTESTED	UNTESTED
View	Download	Polish (polski)	pl	julius	UNTESTED	UNTESTED
View	Download	Portuguese (Português)	pt-br	pocketsphinx	★ ★ ★ ★ (51x)	☹ (11x)
View	Download	Russian (Русский)	ru	kaldi	★ ★ ★ ★ ★ (2x)	★ ★ ★ ★ ★ (0.5x)
View	Download	Russian (Русский)	ru	pocketsphinx	★ ★ ★ ★ ★ (17x)	☹ (1x)
View	Download	Spanish (Español)	es	kaldi	★ ★ ★ ★ ★ (4x)	★ ★ ★ ★ ★ (1x)
View	Download	Spanish (Español)	es	pocketsphinx	★ ★ ★ ★ (25x)	★ ★ ★ ★ (15x)
View	Download	Spanish	es-mexican	pocketsphinx	★ ★ ★ ★ ★ (9x)	★ ★ ★ ★ (2x)
View	Download	Swedish (svenska)	sv	kaldi	★ ★ ★ ★ (3x)	☹ (1x)
View	Download	Vietnamese (Tiếng Việt)	vi	kaldi	★ ★ ★ ★ ★ (4x)	☹ (1x)

Legend

Each profile is given a ★ rating, indicating how accurate it was at transcribing a set of test WAV files. I'm considering anything below 75% accuracy to be effectively unusable (☹).

Transcription Accuracy
★ ★ ★ ★ ★	[95%, 100%]
★ ★ ★ ★	[90%, 95%)
★ ★ ★	[85%, 90%)
★ ★	[80%, 85%)
★	[75%, 80%)
☹	[0%, 75%)

Profiles are tested in two conditions:

Closed
- All example sentences from the profile's sentences.ini are run through Google WaveNet to produce synthetic speech
- The profile is trained and tested on exactly the sentences it should recognize (ideal case)
- This resembles the intended use case of voice2json, though real world speech will be less perfect
Open
- Speech examples are provided by contributors, VoxForge, or Mozilla Common Voice
- The profile is tested using the sample WAV files with the --open flag
- This (usually) demonstrates why its best to define voice commands first!

Transcription speed-up is given as (Nx) where N is the average ratio of real-time to transcription time. A value of 2x means that voice2json was able to transcribe the test WAV files twice as fast as their real-time durations on average. The reported values come from an Intel Core i7-based laptop with 16GB of RAM, so expect slower transcriptions on Raspberry Pi's.

Acknowledgements

The acoustic models and pronunciation dictionaries come from one of:

When language models or grapheme-to-phoneme models were unavailable, they were generated using:

Data from Universal Dependencies
The Phonetisaurus G2P tool

About

Speech models and artifacts for voice2json

MIT License

Languages

Language:Python 97.5%Language:Shell 2.4%Language:Makefile 0.1%