A list of open(ish) corpora for Automatic Speech Recognition research and development.
This list has a preference for free (i.e. no $ cost) and truly open corpora (i.e. some kind of CC license).
However, not all corpora listed here meet those criteria, but all corpora here are accessible and usable for research and/or commercial use. Some paid corpora with restrictive licenses may be included here (i.e. from the LDC), given their wide use in research and industry.
Feel free to propse additions to the list!
CORPUS | LANGUAGES | # HOURS | # SPEAKERS | DOWNLOAD | LICENSE |
---|---|---|---|---|---|
CommonVoice-1 | English | ~500 hours | https://voice.mozilla.org/en/datasets | CC-0 | |
Yesno | Hebrew | 6 mins | one male | http://www.openslr.org/1/ | CC-0 |
NCHLT Afrikaans | Afrikaans | 56 hours | 210 speakers (98 female / 112 male) | https://repo.sadilar.org/handle/20.500.12185/280 | CC-BY 3.0 |
NCHLT English | English | 56 hours | 210 speakers (100 female / 110 male) | https://repo.sadilar.org/handle/20.500.12185/274 | CC-BY 3.0 |
NCHLT isiNdebele | isiNdebele | 56 hours | 148 speakers (78 female / 70 male) | https://repo.sadilar.org/handle/20.500.12185/272 | CC-BY 3.0 |
NCHLT isiXhosa | isiXhosa | 56 hours | 209 speakers (106 female / 103 male) | https://repo.sadilar.org/handle/20.500.12185/279 | CC-BY 3.0 |
NCHLT isiZulu | isiZulu | 56 hours | 210 speakers (98 female / 112 male) | https://repo.sadilar.org/handle/20.500.12185/275 | CC-BY 3.0 |
NCHLT Sepedi | Sepedi | 56 hours | 210 speakers (100 female / 110 male) | https://repo.sadilar.org/handle/20.500.12185/270 | CC-BY 3.0 |
NCHLT Sesotho | Sesotho | 56 hours | 210 speakers (113 female / 97 male) | https://repo.sadilar.org/handle/20.500.12185/278 | CC-BY 3.0 |
NCHLT Setswana | Setswana | 56 hours | 210 speakers (109 female / 101 male) | https://repo.sadilar.org/handle/20.500.12185/281 | CC-BY 3.0 |
NCHLT Siswati | Siswati | 56 hours | 197 speakers (96 female / 101 male) | https://repo.sadilar.org/handle/20.500.12185/271 | CC-BY 3.0 |
NCHLT Tshivenda | Tshivenda | 56 hours | 208 speakers (83 female / 125 male) | https://repo.sadilar.org/handle/20.500.12185/276 | CC-BY 3.0 |
NCHLT Xitsonga | Xitsonga | 56 hours | 198 speakers (95 female/103 male) | https://repo.sadilar.org/handle/20.500.12185/277 | CC-BY 3.0 |
Lwazi II Cross-lingual Proper Name Corpus | Afrikaans; English; isiZulu; Sesotho | 2 hours 5 mins | 20 speakers | https://repo.sadilar.org/handle/20.500.12185/445 | CC-BY 3.0 |
Lwazi II Proper Name Call Routing Telephone Corpus | English | 2 hours 7 mins | https://repo.sadilar.org/handle/20.500.12185/448 | CC-BY 3.0 | |
Lwazi II Afrikaans Trajectory Tracking Corpus | Afrikaans | 4 hours | one male | https://repo.sadilar.org/handle/20.500.12185/442 | CC-BY 3.0 |
LibriSpeech | English | ~1000 hours | 2484 speakers (1201 female / 1283 male) | http://www.openslr.org/12/ | CC-BY 4.0 |
Zeroth-Korean | Korean | 52.8 hours | 115 speakers | http://www.openslr.org/40/ | CC-BY 4.0 |
Iban | Iban | 8 hours | http://www.openslr.org/24/ https://github.com/sarahjuan/iban | CC-BY-SA 2.0 | |
Vystadial | English; Czech | 41 hours; 15 hours | http://www.openslr.org/6/ | CC-BY-SA 3.0 US | |
Free Spoken Digit Dataset | English | 2,000 isolated digits | 4 speakers | https://github.com/Jakobovski/free-spoken-digit-dataset | CC-BY-SA 4.0 |
Google Javanese | Javanese | http://www.openslr.org/35/ | CC-BY-SA 4.0 | ||
Google Nepali | Nepali | http://www.openslr.org/54/ | CC-BY-SA 4.0 | ||
Google Bengali | Bengali | http://www.openslr.org/53/ | CC-BY-SA 4.0 | ||
Google Sinhala | Sinhala | http://www.openslr.org/52/ | CC-BY-SA 4.0 | ||
Google Sudanese | Sudanese Arabic | http://www.openslr.org/36/ | CC-BY-SA 4.0 | ||
SWC-2017 | English; German; Dutch | 182 hours; 249 hours; 79 hours | 395 speakers; 339 speakers; 145 speakers | https://nats.gitlab.io/swc/ | CC-BY-SA 4.0 |
IBM Recorded Debates v1 | English | 5 hours | 10 speakers | https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Debate%20Speech%20Analysis | CC-BY-ND |
IBM Recorded Debates v2 | English | ~14 hours | 14 speakers | https://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Debate%20Speech%20Analysis | CC-BY-ND |
Tatoeba-Eng | English | hours | 6 speakers | https://voice.mozilla.org/en/datasets | CC BY-NC 4.0 (most audio) / CC BY-NC-ND 3.0 (some audio) / CC BY 2.0 (all text) |
CHiME-Home | English | 6.8 hours | https://archive.org/details/chime-home | CC-BY-NC-SA 3.0 | |
TED-LIUM | English | 118 hours | 685 speakers (36h female / 81h male) | http://www.openslr.org/7/ | CC-BY-NC-ND 3.0 |
TED-LIUM-2 | English | 207 hours | 1242 speakers (66h female / 141h male) | http://www.openslr.org/19/ | CC-BY-NC-ND 3.0 |
TED-LIUM-3 | English | 452 hours | 2028 speakers (134h female / 316h male) | http://www.openslr.org/51/ | CC-BY-NC-ND 3.0 |
Pansori TEDxKR | Korean | 3 hours | 41 speakers | http://www.openslr.org/58/ | CC-BY-NC-ND 4.0 |
Primewords Mandarin | Mandarin | 100 hours | 296 speakers | http://www.openslr.org/47/ | CC-BY-NC-ND 4.0 |
VoxForge | English | ~120 hours | ~2966 speakers | http://www.voxforge.org/home/downloads https://voice.mozilla.org/en/datasets | GPL 3.0 |
AISHELL-1 | Mandarin | 170 hours | 400 speakers | http://www.openslr.org/33/ | Apache 2.0 |
Tunisian_MSA | Modern Standard Arabic (Tunisia) | 11.2 hours | 118 speakers | http://www.openslr.org/46/ | Apache 2.0 |
African Accented French | French | 22 hours | 232 speakers | http://www.openslr.org/57/ | Apache 2.0 |
ALFFA | Amharic;Hausa (paid); Swahili; Wolof | http://www.openslr.org/25/ https://github.com/besacier/ALFFA_PUBLIC | MIT | ||
CMU Wilderness | 700 Langs | Alignments distributed without audio or text total:~14,000 hours; per lang: ~20 hours | https://github.com/festvox/datasets-CMU_Wilderness | Questionable Legality: https://live.bible.is/terms | |
CHiME-5 | English | 50 hours | 48 speakers | http://spandh.dcs.shef.ac.uk/chime_challenge/data.html | CHiME-5 License |