Chribit / NLP

A collection of resources for natural language processing. Mostly links to datasets for machine learning approaches.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

NLP

A collection of resources for natural language processing. Mostly links to datasets for machine learning approaches.

Datasets

  1. Mozilla Common Voice
    Link: https://commonvoice.mozilla.org/en/datasets
    Content:

    • 2185 hours validated english (of 2886 hours total) by 79398 voices
    • 1062 hours validated german (of 1133 hours total) by 16390 voices
    • 2000 hours validated kinyarwanda (of 2383 hours total) by 1055 voices
    • 826 hours validated french (of 902 hours total) by 16082 voices
    • 404 hours validated spanish (of 739 hours total) by 22741 voices
    • 162 hours validated russian (of 193 hours total) by 2452 voices
    • 310 hours validated italian (of 335 hours total) by 6576 voices
    • and many more languages
  2. LJ Speech Dataset
    Link: https://keithito.com/LJ-Speech-Dataset/
    Content:

    • 24 hours english by 1 voice
  3. CSTR VCTK Corpus
    Link: https://datashare.ed.ac.uk/handle/10283/3443
    Content:

    • ~400 sentences english each by 110 voices
  4. Libri Vox
    Link: https://librivox.org
    Content:

    • unknown amounts of voices
    • 33270 books english
    • 2649 books german
    • 868 books french
    • 742 books spanish
    • 261 books italian
    • 430 books chinese
  5. Libri Speech
    Link: https://www.openslr.org/12
    Content:

  6. Vox Forge
    Link: http://www.repository.voxforge1.org/downloads/SpeechCorpus/Trunk/
    Content:

    • collection of speech files / transcripts by unknown amount of voices
    • 6319 files english
    • 2260 files french
    • 1419 files german
    • 1060 files italian
    • 630 files russian
    • 2248 files spanish
    • and some more
  7. TED LIUM
    Link: https://www.openslr.org/51/
    Content:

    • 452 hours by unknown amount of voices
  8. Torsten Müller
    Link: https://www.openslr.org/110/
    Content:

    • 300 phrases in 8 different emotions
    • ~3 hours german by 1 voice
  9. Emotional Voices Database
    Link: https://www.openslr.org/115/
    Content:

    • collection of audio with 3-5 different emotions
    • ~7000 files english by 4 voices
  10. Tatoeba
    Link: https://tatoeba.org/en/downloads
    Content:

    • sentences with audio files by unknown amounts of voices
    • 692348 sentences english
    • 113008 sentences spanish
    • 32951 sentences german
    • 8173 sentences french
    • 7575 sentences russian
    • 1747 sentences mandarin chinese
    • 1529 sentences japanese
    • 198 sentences italian
    • and many more
  11. Spoken Wikipedia Corpora
    Link: https://nats.gitlab.io/swc/
    Content:

    • spoken wikipedia articles
    • 386 hours german by 339 voices
    • 395 hours english by 395 voices
    • 224 hours dutch by 145 voices
  12. M-AILABS Speech Dataset
    Link: https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/
    Content:

    • mostly extracted from LibriVox (see 4.)
    • 237 hours german
    • 45 hours british english
    • 102 hours american english
    • 108 hours spanish
    • 127 hours italian
    • 87 hours ukranian
    • 46 hours russian
    • 190 hours french
    • 53 hours polish
    • contains mixed data i.e. female and male speakers
  13. VCTK Noisy Speech Database
    Link: https://datashare.ed.ac.uk/handle/10283/2791
    Content:

    • noisy and clean audio files by up to 56 voices
    • includes written transcripts
    • unknown amount of hours
  14. American English Speech Corpus
    Link: https://www.magicdatatech.com/datasets/mdt-tts-e018-american-english-speech-corpus-for-tts-1631179203
    Content:

    • ~2 hours american english by 1 female voice
  15. American Male Voice Dataset
    Link: https://www.magicdatatech.com/datasets/mdt-tts-e009-american-male-voice-tts-dataset
    Content:

    • 15 hours american english by 1 male voice
  16. Facebook Vox Populi
    Link: https://github.com/facebookresearch/voxpopuli
    Content:

    • download instructions in README of repository
    • in 16 european languages including english, german, french and spanish
    • 1800 hours transcribed audio by unknown amount of voices
  17. Multilingual Libri Speech
    Link: https://openslr.org/94/
    Content:

    • unclear if transcripts provided
    • extracted from LibriVox (see 4.)
  18. Kensho SPGI Speech
    Link: https://datasets.kensho.com/datasets/spgispeech
    Content:

    • transcribed company earnings calls
    • ~5000 hours international business english by ~50000 voices
  19. Free Spoken Digit Dataset
    Link: https://github.com/Jakobovski/free-spoken-digit-dataset
    Content:

    • 3000 recordings english by 6 voices
    • 50 recordings per digit per voice
  20. Flickr Audio Captions Corpus
    Link: https://groups.csail.mit.edu/sls/downloads/flickraudio/index.cgi
    Content:

About

A collection of resources for natural language processing. Mostly links to datasets for machine learning approaches.