asr audio datasets noise speaker speech tts

Audio/Speech Datasets

A list of various Audio/Speech datasets about Speech Recognition, Speech Synthesis, Noise, Audio Tagging/Sound Event Detection, Speaker Diarization, Speaker Recognition, (Inverse) Text normalization, Speech Translation, Multilingual, etc. (continuously update)

Audio/Speech Datasets

Table of contents generated with markdown-toc

Overview

Task
- ASR
- TTS
- Noise
- Audio/Sound
- SD
- SR
- TN/ITN
- ST
Language
- chinese
- english
- ohter

Task

Speech Recognition

chinese

Name	Duration(hours)	Links	Comments
THCHS-30	30	[SLR18]	train 30 speakers, 10893 utterances test 10 speakers, 2496 utterances
Aishell	179	[SLR33]	400 speakers
Aishell2	1000	[Website]	if available, 1991 speakers
Free ST Chinese Mandarin (ST-CMDS)	110	[SLR38]	855 speakers, 102600 utterances
Primewords Chinese Corpus Set 1	99	[SLR47]	296 native Chinese speakers
aidatatang_200zh	200	[SLR62]	600 speakers
aidatatang_1505zh	1505	[Github]	if available
MAGICDATA Mandarin Read	755	[SLR68]	1080 speakers
MAGICDATA Mandarin Conversational (RAMC)	180	[SLR123]	663 speakers
AliMeeting (M2MeT)	118.75 (train/dev/test 104.75/4/10)	[SLR119]	ASR, SD
WenetSpeech	10000+	[SLR121] [Github] [Website]
TAL-ASR	100	[Website]	80+ speakers
TAL-CSASR	587	[Website]	code-switching, 200+ speakers
didispeech			if available

english

Name	Duration(hours)	Links	Comments
LibriSpeech	1000	[SLR12] [LM]
GigaSpeech	33,000+ for unsupervised 10,000 for supervised	[Github]
Multilingual LibriSpeech (MLS)		[SLR94]	Multilingual
libri-light	60,000 unlabelled speech	[Github]	pretraining, unsupervised, semi-supervised
libriheavy	50,000	[Github]	casing, punctuation, context
Spgispeech
People's Speech

Speech Synthesis

chinese

Name	Duration(hours)	Links	Comments
AISHELL-3	85	[Website]	44.1k, 218 native Chinese spearkers, 88035 utterances
LibriTTS

Noise

Name	Duration(hours)	Links	Comments
MUSAN		[SLR17]
Aachen Impulse Response database (AIR)		[SLR20]
Simulated Room Impulse Response Database		[SLR26]
Room Impulse Response and Noise Database		[SLR28]

Audio Tagging/Sound Event Detection

Speaker Diarization

Name	Duration(hours)	Links	Comments
AliMeeting (M2MeT)	118.75 (train/dev/test 104.75/4/10)	[SLR119]	ASR, SD

Speaker Recognition

(Inverse) Text normalization

Speech Translation

GigaST

GigaS2S

Reference

About

:scroll: A list of various Audio/Speech datasets about Speech Recognition, Speech Synthesis, Noise, Audio Tagging/Sound Event Detection, Speaker Diarization, Speaker Recognition, (Inverse) Text normalization, Speech Translation, Multilingual, etc. (continuously update)

https://github.com/weimeng23/audio-speech-datasets

asr audio datasets noise speaker speech tts

Creative Commons Attribution Share Alike 4.0 International