A list of open speech corpora for Speech Technology research and development.
This list has a preference for free (i.e. no $ cost) and truly open corpora (i.e. some kind of Creative Commons license). Not all these corpora may meet those criteria, but all the following corpora are accessible and usable for research and/or commercial use.
Feel free to propse additions to the list!
CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
---|---|---|---|---|---|
CommonVoice English | English | 780 hours (validated); 1,087 hours (total) | 39,577 speakers (reported: 11% female / 47% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice German | German | 325 hours (validated); 340 hours (total) | 5,007 speakers (reported: 10% female / 68% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice French | French | 173 hours (validated); 184 hours (total) | 3,005 speakers (reported: 9% female / 70% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Welsh | Welsh | 42 hours (validated); 48 hours (total) | 748 speakers (reported: 18% female / 33% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Breton | Breton | 3 hours (validated); 10 hours (total) | 118 speakers (reported: 3% female / 47% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Chuvash | Chuvash | 1 hour (validated); 2 hours (total) | 38 speakers (reported: 0% female / 47% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Turkish | Turkish | 9 hours (validated); 10 hours (total) | 344 speakers (reported: 11% female / 70% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Tatar | Tatar | 22 hours (validated); 26 hours (total) | 132 speakers (reported: 2% female / 83% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Kyrgyz | Kyrgyz | 8 hours (validated); 20 hours (total) | 97 speakers (reported: 47% female / 44% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Irish | Irish | 2 hour (validated); 3 hour (total) | 63 speakers (reported: 15% female / 62% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Kabyle | Kabyle | 181 hours (validated); 192 hours (total) | 584 speakers (reported: 15% female / 57% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Catalan | Catalan | 107 hours (validated); 120 hours (total) | 1,834 speakers (reported: 37% female / 43% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Chinese (Taiwan) | Mandarin (Taiwan) | 33 hours (validated); 43 hours (total) | 949 speakers (reported: 29% female / 46% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Slovenian | Slovenian | 2 hour (validated); 5 hours (total) | 42 speakers (reported: 20% female / 74% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Italian | Italian | 36 hours (validated); 40 hours (total) | 602 speakers (reported: 18% female / 62% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Dutch | Dutch | 18 hours (validated); 23 hours (total) | 502 speakers (reported: 2% female / 72% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Hakha Chin | Hakha Chin | 2 hours (validated); 4 hours (total) | 280 speakers (reported: 20% female / 24% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Esperanto | Esperanto | 13 hours (validated); 16 hours (total) | 129 speakers (reported: 11% female / 51% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Estonian | Estonian | 11 hours (validated); 12 hours (total) | 225 speakers (reported: 37% female / 57% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Persian | Persian | 67 hours (validated); 70 hours (total) | 1,240 speakers (reported: 13% female / 47% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Basque | Basque | 46 hours (validated); 83 hours (total) | 508 speakers (reported: 22% female / 53% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Spanish | Spanish | 27 hours (validated); 31 hours (total) | 611 speakers (reported: 9% female / 74% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Chinese (China) | Mandarin (China) | 11 hours (validated); 12 hours (total) | 288 speakers (reported: 0% female / 76% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Mongolian | Mongolian | 8 hours (validated); 9 hours (total) | 230 speakers (reported: 22% female / 35% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Sakha | Sakha | 3 hours (validated); 6 hours (total) | 35 speakers (reported: 10% female / 54% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Dhivehi | Dhivehi | 5 hours (validated); 8 hours (total) | 92 speakers (reported: 65% female / 27% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Kinyarwanda | Kinyarwanda | <1 hours (validated); 1 hours (total) | 32 speakers (reported: 0% female / 13% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Swedish | Swedish | 3 hours (validated); 3 hours (total) | 44 speakers (reported: 3% female / 75% male) | https://voice.mozilla.org/en/datasets | CC-0 |
CommonVoice Russian | Russian | 27 hours (validated); 31 hours (total) | 64 speakers (reported: 42% female / 55% male) | https://voice.mozilla.org/en/datasets | CC-0 |
Yesno | Hebrew | 6 mins | one male | http://www.openslr.org/1/ | CC-0 |
LJ Speech Corpus | English | ~24 hours | one female | https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2 | CC-0 |
CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
---|---|---|---|---|---|
ARU Speech Corpus | English (UK) | 720 utterances / speaker | 12 (6 femals; 6 male) | http://datacat.liverpool.ac.uk/681/1/ARU_Speech_Corpus_v1_0.zip | CC-BY 3.0 |
Althingi Parliamentary Speech Corpus | Icelandic | 542 hours and 25 minutes | 196 speakers | http://www.malfong.is/index.php?dlid=73&lang=en | CC-BY 4.0 |
Alþingisumræður Parliamentary Speech Corpus | Icelandic | ~21 hours | http://www.malfong.is/index.php?dlid=8&lang=en | CC-BY 3.0 | |
Hjal Corpus | Icelandic | ~41,000 recordings | 883 speakers | http://www.malfong.is/index.php?dlid=5&lang=en | CC-BY 3.0 |
The Malromur Corpus | Icelandic | 152 hours | 563 speakers | http://www.malfong.is/index.php?dlid=65&lang=en | CC-BY 4.0 |
Telecooperation German Corpus for Kinect | German | ~35 hours | ~180 speakers | http://www.repository.voxforge1.org/downloads/de/german-speechdata-TUDa-2015.tar.gz | CC-BY 2.0 |
African Speech Technology English-English Speech Corpus | English | ~21 hours | https://repo.sadilar.org/handle/20.500.12185/283 | CC-BY 2.5 South Africa | |
African Speech Technology isiXhosa Speech Corpus | isiXhosa | ~26 hours | https://repo.sadilar.org/handle/20.500.12185/305 | CC-BY 2.5 South Africa | |
NCHLT Afrikaans | Afrikaans | 56 hours | 210 speakers (98 female / 112 male) | https://repo.sadilar.org/handle/20.500.12185/280 | CC-BY 3.0 |
NCHLT English | English | 56 hours | 210 speakers (100 female / 110 male) | https://repo.sadilar.org/handle/20.500.12185/274 | CC-BY 3.0 |
NCHLT isiNdebele | isiNdebele | 56 hours | 148 speakers (78 female / 70 male) | https://repo.sadilar.org/handle/20.500.12185/272 | CC-BY 3.0 |
NCHLT isiXhosa | isiXhosa | 56 hours | 209 speakers (106 female / 103 male) | https://repo.sadilar.org/handle/20.500.12185/279 | CC-BY 3.0 |
NCHLT isiZulu | isiZulu | 56 hours | 210 speakers (98 female / 112 male) | https://repo.sadilar.org/handle/20.500.12185/275 | CC-BY 3.0 |
NCHLT Sepedi | Sepedi | 56 hours | 210 speakers (100 female / 110 male) | https://repo.sadilar.org/handle/20.500.12185/270 | CC-BY 3.0 |
NCHLT Sesotho | Sesotho | 56 hours | 210 speakers (113 female / 97 male) | https://repo.sadilar.org/handle/20.500.12185/278 | CC-BY 3.0 |
NCHLT Setswana | Setswana | 56 hours | 210 speakers (109 female / 101 male) | https://repo.sadilar.org/handle/20.500.12185/281 | CC-BY 3.0 |
NCHLT Siswati | Siswati | 56 hours | 197 speakers (96 female / 101 male) | https://repo.sadilar.org/handle/20.500.12185/271 | CC-BY 3.0 |
NCHLT Tshivenda | Tshivenda | 56 hours | 208 speakers (83 female / 125 male) | https://repo.sadilar.org/handle/20.500.12185/276 | CC-BY 3.0 |
NCHLT Xitsonga | Xitsonga | 56 hours | 198 speakers (95 female/103 male) | https://repo.sadilar.org/handle/20.500.12185/277 | CC-BY 3.0 |
Lwazi II Cross-lingual Proper Name Corpus | Afrikaans; English; isiZulu; Sesotho | 2 hours 5 mins | 20 speakers | https://repo.sadilar.org/handle/20.500.12185/445 | CC-BY 3.0 |
Lwazi II Proper Name Call Routing Telephone Corpus | English | 2 hours 7 mins | https://repo.sadilar.org/handle/20.500.12185/448 | CC-BY 3.0 | |
Lwazi II Afrikaans Trajectory Tracking Corpus | Afrikaans | 4 hours | one male | https://repo.sadilar.org/handle/20.500.12185/442 | CC-BY 3.0 |
LibriSpeech | English | ~1000 hours | 2484 speakers (1201 female / 1283 male) | http://www.openslr.org/12/ | CC-BY 4.0 |
Zeroth-Korean | Korean | 52.8 hours | 115 speakers | http://www.openslr.org/40/ | CC-BY 4.0 |
Speech Commands | English | 17.8 hours | >1,000 speakers | https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html | CC-BY 4.0 |
ParlamentParla | Catalan | 320 hours | https://www.openslr.org/59/ | CC-BY 4.0 | |
SIWIS | French | ~10 hours | one female | http://datashare.is.ed.ac.uk/download/DS_10283_2353.zip | CC-BY 4.0 |
VCTK | English | 44 hours | 109 speakers | http://datashare.is.ed.ac.uk/download/DS_10283_3443.zip | CC-BY 4.0 |
LibriTTS | English | 586 hours | 2,456 speakers (1,185 female / 1,271 male) | http://www.openslr.org/60/ | CC-BY 4.0 |
Augmented LibriSpeech | Audio (English); Text (English, French) | 236 hours | https://persyval-platform.univ-grenoble-alpes.fr/DS91/detaildataset | CC-BY 4.0 | |
Helsinki Prosody Corpus | English | 262.5 hours | 1,230 speakers | https://github.com/Helsinki-NLP/prosody | CC-BY 4.0 |
CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
---|---|---|---|---|---|
Iban | Iban | 8 hours | http://www.openslr.org/24/ https://github.com/sarahjuan/iban | CC-BY-SA 2.0 | |
Vystadial | English; Czech | 41 hours; 15 hours | http://www.openslr.org/6/ | CC-BY-SA 3.0 US | |
Free Spoken Digit Dataset | English | 2,000 isolated digits | 4 speakers | https://github.com/Jakobovski/free-spoken-digit-dataset | CC-BY-SA 4.0 |
Google Javanese | Javanese | 296 hours | 1019 speakers | http://www.openslr.org/35/ | CC-BY-SA 4.0 |
Google Nepali | Nepali | 165 hours | 527 speakers | http://www.openslr.org/54/ | CC-BY-SA 4.0 |
Google Bengali | Bengali | 229 hours | 508 speakers | http://www.openslr.org/53/ | CC-BY-SA 4.0 |
Google Sinhala | Sinhala | 224 hours | 478 speakers | http://www.openslr.org/52/ | CC-BY-SA 4.0 |
Google Sundanese | Sundanese | 333 hours | 542 speakers | http://www.openslr.org/36/ | CC-BY-SA 4.0 |
Spokend Wikipedia Corpus (SWC-2017) | English; German; Dutch | 182 hours; 249 hours; 79 hours | 395 speakers; 339 speakers; 145 speakers | https://nats.gitlab.io/swc/ | CC-BY-SA 4.0 |
Chuvash TTS | Chuvash | 4 hours | 1 speaker | https://github.com/ftyers/Turkic_TTS | CC-BY-SA 4.0 |
Forschergeist | German | 2 hours | 2 speakers (1 female; 1 male) | female speaker: https://goofy.zamia.org/zamia-speech/corpora/forschergeist/annettevogt-20180320-rec.tgz; male speaker: https://goofy.zamia.org/zamia-speech/corpora/forschergeist/timpritlove-20180320-rec.tgz | CC-BY-SA 4.0 |
CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
---|---|---|---|---|---|
IBM Recorded Debates v1 | English | 5 hours | 10 speakers | https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Debate%20Speech%20Analysis | CC-BY-ND |
IBM Recorded Debates v2 | English | ~14 hours | 14 speakers | https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Debate%20Speech%20Analysis | CC-BY-ND |
CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
---|---|---|---|---|---|
TV3Parla | Catalan | 240 hours | http://laklak.eu/share/tv3_0.3.tar.gz | CC-BY-NC 4.0 | |
Russian Open STT Corpus | Russian | ~7000 hours | https://github.com/snakers4/open_stt/#links | CC-BY-NC 4.0 with some expections | |
Russian Open TTS Corpus | Russian | 145 hours | 3 males | https://github.com/snakers4/open_tts/#links | CC-BY-NC 4.0 with some expections |
CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
---|---|---|---|---|---|
CHiME-Home | English | 6.8 hours | https://archive.org/details/chime-home | CC-BY-NC-SA 3.0 | |
Cameroon Pidgin English Corpus | Cameroon Pidgin English | ~17 hours | http://ota.ox.ac.uk/text/2563.zip | CC-BY-NC-SA 3.0 |
CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
---|---|---|---|---|---|
Tatoeba-Eng | English | ~250 hours (rough estimate) | 6 speakers | https://voice.mozilla.org/en/datasets | CC BY-NC 4.0 (some audio) / CC BY-NC-ND 3.0 (most audio) / CC BY 2.0 (all text) |
TED-LIUM | English | 118 hours | 685 speakers (36h female / 81h male) | http://www.openslr.org/7/ | CC-BY-NC-ND 3.0 |
TED-LIUM-2 | English | 207 hours | 1242 speakers (66h female / 141h male) | http://www.openslr.org/19/ | CC-BY-NC-ND 3.0 |
TED-LIUM-3 | English | 452 hours | 2028 speakers (134h female / 316h male) | http://www.openslr.org/51/ | CC-BY-NC-ND 3.0 |
Pansori TEDxKR | Korean | 3 hours | 41 speakers | http://www.openslr.org/58/ | CC-BY-NC-ND 4.0 |
Primewords Mandarin | Mandarin | 100 hours | 296 speakers | http://www.openslr.org/47/ | CC-BY-NC-ND 4.0 |
MuST-C v1.0 | Audio (English); Text (Dutch, French, German, Italian, Portuguese, Romanian, Russian, Spanish) | 408, 504, 492, 465, 442, 385, 432, 489 hours per language pair | https://ict.fbk.eu/must-c-release-v1-0/ | CC-BY-NC-ND 4.0 |
CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
---|---|---|---|---|---|
VoxForge | English | ~120 hours | ~2966 speakers | http://www.repository.voxforge1.org/downloads/en/Trunk/Audio/Main/16kHz_16bit/ https://voice.mozilla.org/en/datasets | GNU-GPL 3.0 |
VoxForge | Russian | http://www.repository.voxforge1.org/downloads/ru/Trunk/Audio/Main/16kHz_16bit/ http://www.repository.voxforge1.org/downloads/Russian/Trunk/Audio/Main/16kHz_16bit/ | GNU-GPL 3.0 | ||
VoxForge | German | http://www.repository.voxforge1.org/downloads/de/Trunk/Audio/Main/16kHz_16bit/ | GNU-GPL 3.0 |
CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
---|---|---|---|---|---|
AISHELL-1 | Mandarin | 170 hours | 400 speakers | http://www.openslr.org/33/ | Apache 2.0 |
Tunisian_MSA | Modern Standard Arabic (Tunisia) | 11.2 hours | 118 speakers | http://www.openslr.org/46/ | Apache 2.0 |
African Accented French | French | 22 hours | 232 speakers | http://www.openslr.org/57/ | Apache 2.0 |
THCHS-30 | Mandarin Chinese | 33.57 hours (13,389 utterances) | 40 speakers (31 female; 9 male) | http://www.openslr.org/18/ | Apache 2.0 |
Living Audio Dataset - Dutch | Dutch | 57:49 min | 1 speaker | https://github.com/Idlak/Living-Audio-Dataset | Apache 2.0 |
Living Audio Dataset - English | English | 50:50 min | 1 speaker | https://github.com/Idlak/Living-Audio-Dataset | Apache 2.0 |
Living Audio Dataset - Irish | Irish | 61:56 min | 1 speaker | https://github.com/Idlak/Living-Audio-Dataset | Apache 2.0 |
Living Audio Dataset - Russian | Russian | 34:58 min | 1 speaker | https://github.com/Idlak/Living-Audio-Dataset | Apache 2.0 |
CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
---|---|---|---|---|---|
ALFFA | Amharic;Hausa (paid); Swahili; Wolof | http://www.openslr.org/25/ https://github.com/besacier/ALFFA_PUBLIC | MIT |
CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
---|---|---|---|---|---|
M-AILABS German Corpus | German | 237 hours and 22 minutes | http://www.caito.de/data/Training/stt_tts/de_DE.tgz | M-AILABS LICENSE (a data-specific BSD 3-Clause License) | |
M-AILABS Queen's English Corpus | Queen's English | 45 hours and 35 minutes | http://www.caito.de/data/Training/stt_tts/en_UK.tgz | M-AILABS LICENSE (a data-specific BSD 3-Clause License) | |
M-AILABS US English Corpus | American English | 102 hours and 7 minutes | http://www.caito.de/data/Training/stt_tts/en_US.tgz | M-AILABS LICENSE (a data-specific BSD 3-Clause License) | |
M-AILABS Spanish Corpus | Spanish Spanish | 108 hours and 34 minutes | http://www.caito.de/data/Training/stt_tts/es_ES.tgz | M-AILABS LICENSE (a data-specific BSD 3-Clause License) | |
M-AILABS Italian Corpus | Italian | 127 hours and 40 minutes | http://www.caito.de/data/Training/stt_tts/it_IT.tgz | M-AILABS LICENSE (a data-specific BSD 3-Clause License) | |
M-AILABS Ukrainian Corpus | Ukrainian | 87 hours and 8 minutes | http://www.caito.de/data/Training/stt_tts/uk_UK.tgz | M-AILABS LICENSE (a data-specific BSD 3-Clause License) | |
M-AILABS Russian Corpus | Russian | 46 hours and 47 minutes | http://www.caito.de/data/Training/stt_tts/ru_RU.tgz | M-AILABS LICENSE (a data-specific BSD 3-Clause License) | |
M-AILABS French-v0.9 Corpus | French | 190 hours and 30 minutes | http://www.caito.de/data/Training/stt_tts/fr_FR.tgz | M-AILABS LICENSE (a data-specific BSD 3-Clause License) | |
M-AILABS Polish Corpus | Polish | 53 hours and 50 minutes | http://www.caito.de/data/Training/stt_tts/pl_PL.tgz | M-AILABS LICENSE (a data-specific BSD 3-Clause License) |