NLP

A collection of resources for natural language processing. Mostly links to datasets for machine learning approaches.

Datasets

Mozilla Common Voice
Link: https://commonvoice.mozilla.org/en/datasets
Content:
- 2185 hours validated english (of 2886 hours total) by 79398 voices
- 1062 hours validated german (of 1133 hours total) by 16390 voices
- 2000 hours validated kinyarwanda (of 2383 hours total) by 1055 voices
- 826 hours validated french (of 902 hours total) by 16082 voices
- 404 hours validated spanish (of 739 hours total) by 22741 voices
- 162 hours validated russian (of 193 hours total) by 2452 voices
- 310 hours validated italian (of 335 hours total) by 6576 voices
- and many more languages
LJ Speech Dataset
Link: https://keithito.com/LJ-Speech-Dataset/
Content:
- 24 hours english by 1 voice
CSTR VCTK Corpus
Link: https://datashare.ed.ac.uk/handle/10283/3443
Content:
- ~400 sentences english each by 110 voices
Libri Vox
Link: https://librivox.org
Content:
- unknown amounts of voices
- 33270 books english
- 2649 books german
- 868 books french
- 742 books spanish
- 261 books italian
- 430 books chinese
Libri Speech
Link: https://www.openslr.org/12
Content:
- extracted from LibriVox (see 4.)
- ~1000 hours english
- ~585 hours in higher quality at https://openslr.org/60/
Vox Forge
Link: http://www.repository.voxforge1.org/downloads/SpeechCorpus/Trunk/
Content:
- collection of speech files / transcripts by unknown amount of voices
- 6319 files english
- 2260 files french
- 1419 files german
- 1060 files italian
- 630 files russian
- 2248 files spanish
- and some more
TED LIUM
Link: https://www.openslr.org/51/
Content:
- 452 hours by unknown amount of voices
Torsten Müller
Link: https://www.openslr.org/110/
Content:
- 300 phrases in 8 different emotions
- ~3 hours german by 1 voice
Emotional Voices Database
Link: https://www.openslr.org/115/
Content:
- collection of audio with 3-5 different emotions
- ~7000 files english by 4 voices
Tatoeba
Link: https://tatoeba.org/en/downloads
Content:
- sentences with audio files by unknown amounts of voices
- 692348 sentences english
- 113008 sentences spanish
- 32951 sentences german
- 8173 sentences french
- 7575 sentences russian
- 1747 sentences mandarin chinese
- 1529 sentences japanese
- 198 sentences italian
- and many more
Spoken Wikipedia Corpora
Link: https://nats.gitlab.io/swc/
Content:
- spoken wikipedia articles
- 386 hours german by 339 voices
- 395 hours english by 395 voices
- 224 hours dutch by 145 voices
M-AILABS Speech Dataset
Link: https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/
Content:
- mostly extracted from LibriVox (see 4.)
- 237 hours german
- 45 hours british english
- 102 hours american english
- 108 hours spanish
- 127 hours italian
- 87 hours ukranian
- 46 hours russian
- 190 hours french
- 53 hours polish
- contains mixed data i.e. female and male speakers
VCTK Noisy Speech Database
Link: https://datashare.ed.ac.uk/handle/10283/2791
Content:
- noisy and clean audio files by up to 56 voices
- includes written transcripts
- unknown amount of hours
American English Speech Corpus
Link: https://www.magicdatatech.com/datasets/mdt-tts-e018-american-english-speech-corpus-for-tts-1631179203
Content:
- ~2 hours american english by 1 female voice
American Male Voice Dataset
Link: https://www.magicdatatech.com/datasets/mdt-tts-e009-american-male-voice-tts-dataset
Content:
- 15 hours american english by 1 male voice
Facebook Vox Populi
Link: https://github.com/facebookresearch/voxpopuli
Content:
- download instructions in README of repository
- in 16 european languages including english, german, french and spanish
- 1800 hours transcribed audio by unknown amount of voices
Multilingual Libri Speech
Link: https://openslr.org/94/
Content:
- unclear if transcripts provided
- extracted from LibriVox (see 4.)
Kensho SPGI Speech
Link: https://datasets.kensho.com/datasets/spgispeech
Content:
- transcribed company earnings calls
- ~5000 hours international business english by ~50000 voices
Free Spoken Digit Dataset
Link: https://github.com/Jakobovski/free-spoken-digit-dataset
Content:
- 3000 recordings english by 6 voices
- 50 recordings per digit per voice
Flickr Audio Captions Corpus
Link: https://groups.csail.mit.edu/sls/downloads/flickraudio/index.cgi
Content:
- 40000 spoken image captions english of 8000 images
- download original captions here https://www.kaggle.com/adityajn105/flickr8k

Chribit / NLP

NLP

Datasets

About