Indian Language ASR

Speech Lab, IIT Madras announces Automatic Speech Recognition (ASR) Challenge in three Indian languages - Hindi, Tamil and Indian-English. This challenge is the third challenge in the series of ASR challenges planned. In this installment of the challenge, approximately 490 hours of transcribed speech data in three Indian languages will be made open source. This data subsumes the data released in the previous challenges. The details of the first and the second challenges can be found here and here. These challenges are a part of the National Language Translation Mission funded by MeitY. They aim towards helping and encouraging the advancement of ASR in Indian Languages. We plan to have a series of challenges with increasing difficulty in different Indian languages, and release appropriate data with each challenge. In the first two challenges, we had released everything including source codes so that start-ups/Universities/Research-Labs without previous experience in ASR can also participate and get familiar with it.

CHALLENGE OVERVIEW

Recent advancements in Speech technology have shown that ASR systems can work on par with humans. To build a good ASR system requires large amounts of training data and high-end computational resources.

However, when it comes to Indian languages, not everyone, especially academic institutions and startups, have access to these resources. As a part of this challenge, we will be releasing speech data in Hindi, Tamil and Indian-English. Everyone who participates in this challenge will then be free to use this data for research purposes

DATA SET DETAILS

The data set comprises of Hindi, Tamil and Indian-English read and conversational speech data along with the corresponding transcriptions. This speech data was collected by Speech Lab IITM and several startups. We will be releasing approximately 490 hours of speech data in this challenge round. The details of the data sets released for this challenge are as follows:

Set	Train set	Development set	Evaluation set	Total duration
HINDI	178.4 hours	4.8 hours	4.9 hours	188.1 hours
TAMIL	104.5 hours	3.9 hours	3.8 hours	112.2 hours
INDIAN ENGLISH	179.5 hours	5.4 hours	5.4 hours	190.3 hours

Lexicon has also been made available. The lexicon was generated using the Unified-parser (Hindi and Tamil) and CMU Lexicon tool (Indian-English). The Hindi and English data released in this challenge includes the Hindi data released in the first challenge and "IITM" English data released in the second challenge respectively. So approximately 490 hours + 200 hours (NPTEL data from second challenge) = 690 hours of transcribed speech data has been released through these three challenges.

IMPORTANT DATES

Release of training data, development data and, lexicon: May 13, 2021
Evaluation data release and opening of submission site: ~~July 7th, 2021~~ July 14th, 2021
Closing of submission site: ~~July 14th, 2021~~ July 21st, 2021(midnight anywhere in the world, i.e., 12pm UTC on July 21st, 2021)
Announcement of results: July 22nd, 2021

Models

The models for Indian English, Hindi, Tamil are uploaded in google drive. These models can be downloaded using the links below.

Google drive link

Speech-Lab-IITM / Indian_Language_ASR

Indian Language ASR

CHALLENGE OVERVIEW

DATA SET DETAILS

IMPORTANT DATES

Models

About

Languages