WangHelin1997 / SpeechTasks

This is a list of speech tasks and datasets, which can provide training data for Generative AI, AIGC, AI model training, intelligent speech tool development, and speech applications.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SpeechTasks

This is a list of speech tasks and datasets, which can provide training data for Generative AI, AIGC, AI model training, intelligent speech tool development, and speech applications.

Continuously Updating!

I will add new tasks and datasets to this repo continously.

You are welcome to put an issue or email me at hwang258@jhu.edu, to point out any unlisted tasks and datasets!

Table of Contents

  1. Tasks and DataSets
  2. Tasks with different input & output modes
  3. Tasks with different Level

Tasks and DataSets

Task DataSets Input Mode Output Mode Modeling Target Level Description
Accent Classification AccentDB Extended Dataset Audio Label Classification Acoustic, Language Accent classification involves the recognition and classification of specific speech accents.The task involves the recognition and classification of specific speech accents. The possible answers include American, Australian, Bangla, British, Indian, Malayalam, Odiya, Telugu, or Welsh. The objective is to correctly identify these accents based on the given speech samples, contributing to a system's ability to understand and interact with various speakers.
Accented Text-to-speech L2-ARCTIC Text, Audio Audio Generation Acoustic, Language Accented text-to-speech (TTS) synthesis aims to synthesize speech with a given foreign accent instead of native speech.
Acoustic Echo Cancellation AEC Challenge Audio Audio Regression Acoustic The Acoustic Echo Cancellation block is designed to remove echoes, reverberation, and unwanted added sounds from a signal that passes through an acoustic space.
Automatic Speech Recognition LibriSpeech
Common Voice
VoxPopuli
MLS
Libri-light
AISHELL
GigaSpeech
CoVoST
Libriheavy
TED-LIUM
TIMIT
WenetSpeech
Audio Text Classification Content Speech recognition, also known as automatic speech recognition (ASR), computer speech recognition, or speech-to-text, is a capability which enables a program to process human speech into a written format.
DeepFake Detection (Spoof Detection) ASVspoof 2015 Dataset
ASVspoof 2017 Dataset
ASVspoof2019
ASVspoof2021
ASVspoof5
ADD Challenge
In-the-Wild
WaveFake
SingFake
Audio Binary Label Binary Classification Acoustic Audio deepfake detection is a task that aims to distinguish genuine utterances from fake ones via machine learning techniques. <!--
Dialogue Act Classification DailyTalk Dataset Audio Label Classification Understanding Dialogue act classification aims to identify the primary purpose or function of an utterance within its dialogue context.The aim of this task is to identify the action in the audio. The possible answers could be question, inform, directive, or commissive. These identification tasks are important, as dialogue acts are central to understanding human conversation and dialogue-based AI system communication.
Dialogue Act Pairing DailyTalk Dataset Audio, Label Binary Label Binary Classification Understanding Dialogue act pairing involves assessing the congruence of dialogue acts—that is, whether a response dialogue act is appropriate given a query dialogue act. The objective is to determine whether a given dialogue act pairing is congruent or not. The answer could either be true or false. Being able to accurately judge the appropriateness of dialogue acts is key for a universal speech model to understand and participate in human conversations effectively.
Dialogue Emotion Classification DailyTalk Dataset Audio Label Classification Emotion Dialogue emotion classification is a task that assesses an AI model's ability to identify the most suitable emotion in a given dialogue extract. The main goal of this task is to correctly identify the communicated emotion in an audio clip. Possible answers include anger, disgust, fear, sadness, happiness, surprise, or no emotion. It is an evaluation of the model's capacity to interpret and distinguish emotions conveyed through speech, accounting both for linguistic content and paralinguistic indicators.
Dysarthric Speech Assessments UASpeech
TORGO
Audio Scalar Regression Acoustic Dysarthric speech assessments regarding speech intelligibility are conducted to check the patient’s status and track the effectiveness of treatments.
Dysarthric Speech Recognition UASpeech
TORGO
Audio Text Classification Content Dysarthric Speech Recognition is a task that aims to transcribe dysarthria speech which is a motor speech disorder caused by conditions like Parkinson’s disease or amyotrophic lateral sclerosis (ALS).
Emotion Recognition Multimodal EmotionLines Dataset
IEMOCAP
MELD
CREMA-D
MSP-Podcast
SAVEE
MESD
CMU-MOSEI
MEAD
Audio Label Classification Emotion Emotion recognition aims to identify the most appropriate emotional category for a given utterance.Recognizing the emotion expressed in an utterance can be quite challenging. While we can sometimes identify emotion from the linguistic content alone, the more important factors often lie in paralinguistic features — like pitch, rhythm, and other prosodic elements. For a universal speech model, understanding these paralinguistic features is crucial, as they distinguish speech from mere text in a significant manner.
Emotional TTS RAVDESS
EMOV-DB
LJSpeech Dataset
IEMOCAP
Text,Label Audio Generation Acoustic, Emotion Emotional text-to-speech (TTS) aims to synthesize speech with specfic emotional types.
Enhancement Detection LibriTTS-TestClean Audio Binary Label Binary Classification Acoustic Enhancement detection is a task focused on determining whether a given audio has been created or modified by a speech enhancement model. The objective of enhancement detection is to ascertain if an audio file has been created or altered by a speech enhancement model. The expected answer is either yes or no. The task poses a challenging problem because the speech model must not only process the content of the speech but also detect minute modifications that might indicate enhancement.
Expressive TTS Expresso Text, Label Audio Generation Acoustic, Understanding Expressive text-to-speech (TTS) aims to synthesize speech with specfic reading types or improvised styles.
HowFarAreYou 3DSpeaker Dataset
Spatial LibriSpeech
Audio Scalar Regression Acoustic The HowFarAreYou task aims to determine the distance of the speaker from the source of sound. The task's goal is to ascertain the approximate distance of a speaker, based on the provided audio or speech. The task's response could be an exact value, such as 0.4m, 2.0m, or 4.0m, indicating the speaker's distance from the sound source. Gauging the speaker's distance provides insights into the audio's spatial characteristics, which forms a crucial aspect of auditory scene analysis.
Instruct TTS None available Text Audio Generation Acoustic, Understanding Instruct text-to-speech (TTS) aims to synthesize speech with varying speaking styles to better reflect human speech patterns, given a certain instruction.
Intent Classification FluentSpeechCommands Dataset
SLURP
ATIS
Snips
Audio Label Classification Understanding Intent classification aims to identify the actionable item behind a spoken message. The objective of this task is to understand and categorize the intent performed by a spoken message. The recognized actions can vary, including activate, bring, change language, deactivate, decrease, or increase. Identifying the intent accurately is pivotal for building reliable speech-based applications and interfaces. We categorize this task into three types: Action, Location, and Object.
Keyword Spotting Google Speech Commands V1 Dataset
LibriPhrase
Audio, Text Binary Label Binary Classification Content Keyword spotting is a process that helps to detect keywords or phrases used in phone calls or audio recordings. These words and phrases can then be used to adjust the urgency of the call, train your employees, and gauge customer satisfaction.
Language Identification VoxForge Dataset
Common Voice
VoxLingua107
Audio Label Classification Language Language Identification task is aimed to determine the language spoken in a given speech recording. The main goal of this task is to identify the language spoken in a specific speech recording. This is an essential part of speech processing, as it facilitates the understanding and translations for different languages. The language spoken could be German, English, Spanish, Italian, Russian, or French.
Laughter Synthesis Laughterscape Audio, Audio Audio Generation Acoustic Laughter Synthesis task is aimed to generate sound of laughter of a given speaker.
Multilingual Speech Recognition Common Voice
VoxLingua107
MLS
FLEURS
CMU Wilderness
YODAS
Audio Text Classification Content, Language The task of Multilingual Speech Recognition (MSR) involves developing systems that can accurately transcribe speech data across multiple languages. Unlike traditional speech recognition systems that are designed for a specific language, MSR systems aim to handle diverse languages and dialects.
MultiSpeaker Detection LibriSpeech-TestClean Dataset
VCTK Dataset
Audio Binary Label Binary Classification Speaker MultiSpeaker Detection aims to analyze the speech audio to determine whether there is more than one speaker present in it. The core objective of this task is to analyze the speech audio for the presence of more than one speaker. It is crucial for a universal speech model to detect this as the presence of multiple speakers can alter the context and understanding of the spoken content.
Noise Detection LJSpeech dataset
VCTK Dataset
Musan Dataset
Audio Binary Label Binary Classification Acoustic Noise Detection aims to idenetify if the speech audio is clean or mixed with noises.The objective of noise detection is to ascertain if an audio file has been added the noise. The expected answer is either yes or no. There are many types of noises - like music, speech, gaussian or others. The task poses a challenging problem because the speech model must not only process the content of the speech but also understand the degradation of speech.
Noise SNR Level Prediction VCTK Dataset
Musan Dataset
Audio Scalar Regression Acoustic Noise SNR Level Prediction aims to predict the signal-to-noise ratio of the speech audio.The objective of noise SNR level prediction is to evaluate the noise SNR level of an audio file. The expected answer could be zero, five, ten, fifteen or zero. There are many types of noises - like music, speech, gaussian or others. The task poses a challenging problem because the speech model must not only process the content of the speech but also understand the degree of noise degradation.
Non-verbal Voice Recognition CNVVE Audio Label Classification Content Non-verbal Voice Recognition is to recognize non-verbal or non-lexical voice expressions, like humming.
Offensive Language Identification OLID Audio Label Classification Understanding Offensive Language Identification is to identify the type and the target of offensive texts in social media.
Overlapping Speech Detection AMI meeting corpora
DIHARD I Challenge Data
DIHARD II Challenge Data
VoxConverse
Audio Label, Timestamp Classification Content, Speaker Overlapped speech detection (OSD) is a task that estimates onsets and offsets of segments (i.e., a small part of an audio clip) within an audio clip (i.e., utterance, session, conversation as a whole) where more than one speaker is speaking simultaneously.
Reverberation Detection LJSpeech Dataset
VCTK Dataset
RIRs Noises Dataset
Audio Binary Label Binary Classification Acoustic Reverberation Detection aims to detect if the speech audio is clean or mixed with room impulse responses (RIRs) and noises, that is to say reverberation noises. The objective of reverberation detection is to ascertain if an audio file has been added the reverberation noises. The expected answer is either clean or noisy. The reverberation noises can be originated from large room, medium room or small room. The task poses a challenging problem because the speech model must not only process the content of the speech but also understand the degradation of speech in reververation cases.
Sarcasm Detection MUStARD Dataset Audio Binary Label Binary Classification Understanding Sarcasm Detection aims to detect if the sarcasm or the irony present in the speech audio. The objective of sarcasm detection is to recognize the presence of sarcasm or ironic expressions in the speech. The expected answer is either true or false. The task poses a challenging problem because the speech model should understand upper level of the semantic information.
Slot Filling SLURP
ATIS
Snips
Audio Text Classification Understanding The goal of Slot Filling is to identify from a running dialog different slots, which correspond to different parameters of the user’s query. For instance, when a user queries for nearby restaurants, key slots for location and preferred food are required for a dialog system to retrieve the appropriate information. Thus, the main challenge in the slot-filling task is to extract the target entity.
Speaker Counting MUStARD Dataset Audio Label Classification Speaker Speaker Counting aims to identify the total number of speaker in speech audio. The objective of speaker counting is to determine the number of speakers in the audio recording. The expected answer should be one, two, three, four, or five. The task poses a challenging problem because the speech model should undersdand the pattern of different speakers.
Speaker Diarization CHIME 5
CHIME 6
DIHARD II
LibriCSS
AISHELL-4
VoxConverse
Audio Label, Timestamp Classification Speaker Speaker diarisation is the process of partitioning an audio stream containing human speech into homogeneous segments according to the identity of each speaker.
Speaker Identification LibriSpeech-TestClean Dataset
VCTK Dataset
VoxCeleb1
VoxCeleb2
CN-Celeb
AVSpeech
VoxTube
Audio Label Classification Speaker Speaker recognition deals with the identification of the speaker in an audio stream.
Speaker Verification LibriSpeech-TestClean Dataset
VCTK Dataset
VoxCeleb1
VoxCeleb2
CN-Celerb
Audio, Audio Binary Label Binary Classification Speaker Speaker verification aims to verify whether the two given speech audios are from the same speaker. The objective of speaker verification is to exam if the patterns in the two audio recordings come from the same speaker. The expected answer is either yes or no. The task poses a challenging problem because the speech model should undersdand the pattern of different speakers.
Speech Edit LibriTTS
VCTK Dataset
LJSpeech Dataset
Audio, Text Audio Generation Acoustic, Content Speech edit allows the user to edit the recorded speech, e.g., insert missed words, replace mispronounced words, and/or remove unwanted speech or non-speech events, without degrading the quality and naturalness of the edited speech.
Speech Command Recognition Google Speech Commands V1 Dataset Audio Label Classification Content Speech Command Recognition aims to identify the spoken command. The objective of speech command recognition is to comprehend and grasp the command presented in the speech. The expected answer should be yes, no, up, down, left, right, on, off, stop, go, zero, one, two, three, four, five, six, seven, eight, nine, bed, bird, cat, dog, happy, house, marvin, sheila, tree, wow, or silence. The task poses a challenging problem because the speech model should understand the content information from the speech audios.
Speech Dereverberation Reverb-WSJ0
WHAMR!
CHIME 5
CHIME 6
Audio Audio Regression Acoustic Speech Dereverberation is the process by which the effects of reverberation are removed from sound, after such reverberant sound has been picked up by microphones.
Speech Detection LJSpeech dataset
LibriSpeech-TestClean Dataset
LibriSpeech-TestOther Dataset
Audio Binary Label Binary Classification Content Speech Detection, also known as voice activity detection and speech activity detection, aims to identify whether the given audio clip contains real speech or not. The objective of speech detection is to analyze the audio and determine whether it consists of real speech or not. The expected answer is either yes or no. The task poses a challenging problem because the speech model should understand not only the content information from the speech audios but the pattern of human voice.
Speech Enhancement VoiceBank+DEMAND
DNS-Challenge
WHAM!
WHAMR!
Audio Audio Regression Acoustic Speech enhancement aims to improve speech quality by using various algorithms. The objective of enhancement is improvement in intelligibility and/or overall perceptual quality of degraded speech signal using audio signal processing techniques.
Speech Separation WSJ0-2mix
LibriMix
Real-M
WHAM!
WHAMR!
CHIME 5
CHIME 6
AISHELL-4
Audio Audio, Audio Regression Speaker Speech Separation is the extraction of multiple speech signals from a mixture.
Speech Text Matching LJSpeech dataset
LibriSpeech-TestClean Dataset
LibriSpeech-TestOther Dataset
Audio, Text Binary Label Binary Classification Content Speech Text Matching aims to determine if the speech and text are matched. The objective of speech text matching is to assess whether the speech and text share the same underlying message or not. The expected answer is either yes or no. The task poses a challenging problem because the speech model should understand the content information from the speech audios.
Speech-to-speech Translation CVSS
CoVoST 2
Audio Audio Generation Language, Content Speech-to-speech translation consists on translating speech from one language to speech in another language. This can be done with a cascade of automatic speech recognition (ASR), text-to-text machine translation (MT), and text-to-speech (TTS) synthesis sub-systems, which is text-centric.
Speech-to-text Translation MUST-C Audio Text Generation Language, Content Speech-to-text translation, consisting of automatic speech recognition (ASR) and machine translation (MT), refers to the process where spoken language is not only converted to text but also translated into another language.
Speech Quality Assessment VCC2018
BVCC
Audio Scalar Regression Acoustic Speech Quality Assessment is a task that estimates the quality of speech, like mean-opinion-score (MOS).
Spoken Question Answering Spoken-SQuAD
ODSQA
NMSQA
Audio Text Generation Understanding Spoken Question Answering (SQA) aims to find the answer from a spoken document given a question in either text or spoken form. SQA is crucial for personal assistants when replying to the questions from the user’s spoken queries.
Spoken Term Detection LJSpeech dataset
LibriSpeech-TestClean Dataset
LibriSpeech-TestOther Dataset
Audio, Text Binary Label Binary Classification Content Spoken Term Detection aims to check for the existence of the given word in the speech. The objective of spoken term detection is to analyze the speech and indicate whether the word is mentioned or not. The expected answer is either yes or no. The task poses a challenging problem because the speech model should understand the content information from the speech audios.
Stress Detection MIR-SD Dataset Audio Binary Label Binary Classification Acoustic Stress Detection aims to determine the stress placement in English vocabulary. The objective of stress detection is to analyze the stress patterns in English words. The expected answer should be zero, one, two, three, four, or five. For a universal speech model, understanding these paralinguistic features is crucial, as they distinguish speech from mere text in a significant manner.
Target Speaker Extraction WSJ0-2mix
LibriMix
Real-M
WHAM!
WHAMR!
CHIME 5
CHIME 6
Audio,Audio Audio Regression Speaker Target Speaker Extraction aims to segregate the speech of a target speaker from a mixture of interfering speakers with the help of auxiliary information.
Text-To-Speech Synthesis LJ Speech
LibriTTS
AISHELL 3
LibriTTS-R
YTTTS
Text Audio Generation Acoustic Text-to-speech (TTS) synthesis converts normal language text into speech.
Vocal Sound Classification VocalSound Audio Label Classification Acoustic Vocal Sound Classificatio aims at automatic human vocal sound recognition for laughter, sighs, coughs, throat clearing, sneezes, and sniffs.
Voice Conversion LibriTTS
VCTK Dataset
ESD
Audio, Audio Audio Generation Acoustic, Speaker Voice Conversion is a technology that modifies the speech of a source speaker and makes their speech sound like that of another target speaker without changing the linguistic information.

And more

Tasks with different input & output modes

Input Output Tasks
Audio Audio Acoustic Echo Cancellation, Speech Dereverberation, Speech Enhancement, Speech-to-speech Translation
Audio Audio, Audio Speech Separation, Voice Conversion
Audio Binary Label DeepFake Detection, Dialogue Act Classification, Enhancement Detection, MultiSpeaker Detection, Noise Detection, Reverberation Detection, Sarcasm Detection, Speech Detection, Spoof Detection, Stress Detection
Audio Label Accent Classification, Dialogue Act Classification, Dialogue Emotion Classification, Emotion Recognition, Intent Classification, Language Identification, Non-verbal Voice Recognition, Offensive Language Identification, Speaker Counting, Speaker Identification, Speech Command Recognition, Vocal Sound Classification
Audio Label, Timestamp Speaker Diarization, Overlapping Speech Detection
Audio Scalar Dysarthric Speech Assessments, HowFarAreYou, Noise SNR Level Prediction, Speech Quality Assessment
Audio Text Automatic Speech Recognition, Dysarthric Speech Recognition, Multilingual Speech Recognition, Slot Filling, Speech-to-text Translation, Spoken Question Answering
Audio, Audio Audio Laughter Synthesis, Target Speaker Extraction
Audio, Audio Binary Label Speaker Verification
Audio, Label Binary Label Dialogue Act Pairing, Keyword Spotting
Audio, Text Audio Accented Text-to-speech, Speech Edit
Audio, Text Binary Label Speech Text Matching, Spoken Term Detection
Text Audio Instruct TTS, Text-To-Speech Synthesis
Text, Label Audio Emotional TTS, Expressive TTS

Tasks with different Level

Level Tasks
Acoustic Accent Classification, Accented Text-to-speech, Acoustic Echo Cancellation, DeepFake Detection, Dysarthric Speech Assessments, Emotional TTS, Enhancement Detection, Expressive TTS, HowFarAreYou, Instruct TTS, Laughter Synthesis, Noise Detection, Noise SNR Level Prediction, Reverberation Detection, Speech Edit, Speech Dereverberation, Speech Enhancement, Speech Quality Assessment, Spoof Detection, Stress Detection, Text-To-Speech Synthesis, Vocal Sound Classification, Voice Conversion
Content Automatic Speech Recognition, Dysarthric Speech Recognition, Keyword Spotting, Multilingual Speech Recognition, Non-verbal Voice Recognition, Overlapping Speech Detection, Speech Edit, Speech Command Recognition, Speech Detection, Speech Text Matching, Speech-to-speech Translation, Speech-to-text Translation, Spoken Term Detection, Vocal Sound Classification
Emotion Dialogue Emotion Classification, Emotion Recognition, Emotional TTS
Language Accent Classification, Accented Text-to-speech, Language Identification, Multilingual Speech Recognition, Speech-to-speech Translation, Speech-to-text Translation
Speaker MultiSpeaker Detection, Overlapping Speech Detection, Speaker Counting, Speaker Diarization, Speaker Identification, Speaker Verification, Speech Separation, Target Speaker Extraction, Voice Conversion
Understanding Dialogue Act Classification, Dialogue Act Pairing, Expressive TTS, Instruct TTS, Intent Classification, Offensive Language Identification, Sarcasm Detection, Slot Filling, Spoken Question Answering

References

  1. Dynamic-SUPERB

  2. paperswithcode

  3. kaggle

  4. AI-ADL

  5. INTERSPEECH 2023

About

This is a list of speech tasks and datasets, which can provide training data for Generative AI, AIGC, AI model training, intelligent speech tool development, and speech applications.