A Catalog of resources for Indian language NLP

Please suggest any other resources you may be aware of. Raise a pull request or an issue to add more resources to the catalog. Put the proposed entry in the following format:

[Wikipedia Dumps](https://dumps.wikimedia.org/)

Add a small, informative description of the dataset and provide links to any paper/article/site documenting the resource. Mention your name too. We would like to acknowlege your contribution to building this catalog in the CONTRIBUTORS list.

🆕 Added Evaluation Benchmarks sections

👍 Featured Resources

🆕AI4Bharat IndicNLG Suite: NLG benchmark for 11 languages spanning 5 generation tasks. Pre-trained models ara also available.
🆕AI4Bharat IndicBART: Multilingual mBART based embeddings spanning 12 languages for Natural Language Generation (including Indian English).
🆕HiNER: Large manually annotated NER dataset for Hindi (100k sentences, 2m+ tokens) [paper]
AI4Bharat Cross-lingual Semantic Textual Similarity: 10 sentences across 11 en-Indic language pairs annotated on a scale of 0-5 as per SemEval cross-lingual STS guidelines
XL-Sum: Extreme Summarization data for many Indian languages
BUILD: Indian Legal Data Benchmark for rhetorical roles

Browse the entire catalog...

🙋Note: Many known resources have not yet been classified into the catalog. They can be found as open issues in the repo.

Major Indic Language NLP Repositories
Libraries and Tools
Evaluation Benchmarks
Standards
- Unicode Standard
Text Corpora
Models
Speech Corpora
OCR Corpora
Multimodal Corpora
Language Specific Catalogs

Major Indic Language NLP Repositories

Libraries and Tools

Indic NLP Library: Python Library for various Indian language NLP tasks like tokenization, sentence splitting, normalization, script conversion, transliteration, etc
pyiwn: Python Interface to IndoWordNet
Indic-OCR : OCR for Indic Scripts
CLTK: Toolkit for many of the world's classical languages. Support for Sanskrit. Some parts of the Sanskrit library are forked from the Indic NLP Library.
iNLTK: iNLTK aims to provide out of the box support for various NLP tasks that an application developer might need for Indic languages.
Sanskrit Coders Indic Transliteration: Script conversion and romanization for Indian languages.
Smart Sanskirt Annotator: Annotation tool for Sanskrit paper
BNLP: Bengali language processing toolkit with tokenization, embedding, POS tagging, NER suppport
CodeSwitch: Language identification, POS Tagging, NER, sentiment analysis support for code mixed data including Hindi and Nepali language

Evaluation Benchmarks

Benchmarks spanning multiple tasks.

AI4Bharat IndicGLUE: NLU benchmark for 11 languages.
AI4Bharat IndicNLG Suite: NLG benchmark for 11 languages spanning 5 generation tasks.
GLUECoS: For Hindi-English code-mixed benchmark containing the following tasks - Language Identification (LID), POS Tagging (POS), Named Entity Recognition (NER), Sentiment Analysis (SA), Question Answering (QA), Natural Language Inference (NLI).
AI4Bharat Text Classification: A compilation of classification datasets for 10 languages.
WAT 2021 Translation Dataset: Standard train and test sets for translation between English and 10 Indian languages.

Standards

Unicode Standard for Indic Scripts
- An Introduction to Indic Scripts
- Unicode Standard for South Asian Scripts

Text Corpora

Monolingual Corpus

AIBharat IndicCorp: contains 8.9 billion tokens from 12 Indian languages (including Indian English).
Wikipedia Dumps
Common Crawl
- OSCAR Corpus: Released in 2019, large-scaled processed CommonCrawl.
- WMT Common Crawl Dumps: Crawls between 2012 and 2016. Noisy text, needs to be filtered.
- CC-100 Corpus: Facebook CommonCrawl extracted data. They provide scripts for processing CommonCrawl. StatMT has built a replica of the CC-100 corpus using these scripts. You can find it HERE. This corpus also has romanized corpora for some Indian languages.
WMT NEWS Crawl
LDCIL Monolingual Corpus
Charles University Hindi Monolingual Corpus
Charles University Urdu Monolingual Corpus
IIT Bombay Hindi Monolingual Corpus
EMILLE Corpus (multiple Indian languages)
Janmabhumi Malayalam Corpus
Leipzig Corpus
Sanskrit Monolingual and Sandhi-split Corpus
Lot Of Indic Tweets Corpus: Large twitter datasets for telugu (7.9 million) and hindi (17.6 million) and fasttext skipgram and cbow word vectors for the same.
CMU Romanized Hinglish Corpus: See THIS PAPER for details.
JNU-BHLTR Bhojpuri Corpus: Bhojpuri corpus of 45k sentences.
KMI Magahi Corpus:
KMI Awadhi Corpus:
SMC Malayalam text corpus
DNLP-Tel Telugu Corpus: Telugu corpus of 280M tokens and 23M sentences.

Language Identification

VarDial 2018 Language Identification Dataset: 5 languages - Hindi, Braj, Awadhi, Bhojpuri, Magahi.

Lexical Resources

IndoWordNet
IIIT-Hyderabad Word Similarity Database: 7 Indian languages
Facebook Hindi Analogy Dataset
MGAD Hindi Analogy dataset
AI4Bharat Word Frequency Lists: Tokens and their frequencies from the AI4Bharat corpus, a large monolingual corpus.
Hindi RG-63: Hindi version of the Rubenstein and Goodenough (RG-65) word similarity dataset
IITB Cognate Datasets: Dataset of Cognates and False Friend Pairs for 12 Indian Languages. (Paper)

NER Corpora

FIRE 2013 AUKBC NER Corpus
FIRE 2014 AUKBC NER Corpus
IIT Bombay Marathi NER Corpus
WikiAnn NER Corpus (Noisy) DOWNLOAD (Old broken LINK)
IJCNLP 200 NER Corpus: NER corpora for hi, bn, or, te, ur.
a-mma NER data

Parallel Translation Corpus

Samanantar Parallel Corpus: Largest parallel corpus for English and 11 Indian languages. It comprises 46m sentence pairs between English-Indian languages and 82m sentence pairs between Indian languages.
FLORES-101: Human translated evaluation sets for 101 languages released by Facebook. It includes 14 Indic languages. The testsets are n-way parallel.
IIT Bombay English-Hindi Parallel Corpus: Largest en-hi parallel corpora in public domain (about 1.5 million segments)
CVIT-IIITH PIB Multilingual Corpus: Mined from Press Information Bureau for many Indian languages. Contains both English-IL and IL-IL corpora (IL=Indian language).
CVIT-IIITH Mann ki Baat Corpus: Mined from Indian PM Narendra Modi's Mann ki Baat speeches.
PMIndia: Parallel corpus for En-Indian languages mined from Mann ki Baat speeches of the PM of India (paper).
OPUS corpus
WAT 2018 Parallel Corpus: There may significant overlap between WAT and OPUS.
Charles University English-Hindi Parallel Corpus: This is included in the IITB parallel corpus.
Charles University English-Tamil Parallel Corpus
Charles University English-Odia Parallel Corpus v1.0
Charles University English-Odia Parallel Corpus v2.0
Charles University English-Urdu Religious Parallel Corpus
Indian Language Corpora Initiative: Available on TDIL portal on request
IndoWordnet Parallel Corpus: Parallel corpora mined from IndoWordNet gloss and/or examples for Indian-Indian language corpora (6.3 million segments, 18 languages).
MTurk Indian Parallel Corpus
TED Parallel Corpus
JW300 Corpus: Parallel corpus mined from jw.org. Religious text from Jehovah's Witness.
ALT Parallel Corpus: 10k sentences for Bengali, Hindi in parallel with English and many East Asian languages.
FLORES dataset: English-Sinhala and English-Nepali corpora
Uka Tarsadia University Corpus: 65k English-Gujarati sentence pairs. Corpus is described in this paper
NLPC-UoM English-Tamil Corpus: 9k sentences, 24k glossary terms
Wikititles: from statmt
- English-Tamil Wiki Titles
- English-Gujarati Wiki Titles
JNU-BHLTR Bhojpuri Corpus: English-Bhojpuri corpus of 65k sentences
EILMT Corpus
QED Corpus: English-Hindi corpus of 43k sentences from the educational domain.
WikiMatrix Corpus: Mined from Wikipedia, looks noisy.
CCMatrix: Parallel corpus mined from CommonCrawl, looks noisy (statmt repo).
CGNetSwara: Hindi-Gondi parallel corpus (19k sentence pairs)
MTEnglish2Odia: English-Odia (42k pairs)
SAP Software Documentation: test and evaluation set for English-Hindi in the software documentation domain [paper]
BUET English-Bangla Corpus, EMNLP-2020: 2.7M sentences (has overlaps with OPUS)
CLE Parallel Corpus: Parallel corpus for English, Urdu and Nepali.
Itihasa Parallel Corpus: 93k parallel sentences between English and Sanskrit from the Ramanyana and Mahabharata.

Parallel Transliteration Corpus

Dakshina Dataset: The Dakshina dataset is a collection of text in both Latin and native scripts for 12 South Asian languages. Contains an aggregate of around 300k word pairs and 120k sentence pairs.
BrahmiNet Corpus: 110 language pairs mined from ILCI parallel corpus.
Xlit-Crowd: Hindi-English Transliteration Corpus created via crowdsourcing.
Xlit-IITB-Par: Hindi-English Transliteration Corpus mined from parallel translation corpora.
FIRE 2013 Track on Transliterated Search: Transliteration dataset of native words in Hindi, Bengali and Gujarati.
NEWS 2018 Shared Task dataset: Transliteration datasets for Kannada, Tamil, Bengali and Hindi created by Microsoft Research India.
AI4Bharat StoryWeaver Xlit Dataset - Transliteration datasets for Hindi, Maithili & Konkani
Hindi WikiData Transliteration Pairs - Hindi dataset (90k pairs)
NotAI-tech English-Telugu: Around 38k word pairs

Text Classification

BBC news articles classification dataset: 14 class classification
iNLTK News Headlines classification: Datasets for multiple Indian languages.
AI4Bharat IndicNLP News Articles: Word embeddings for 10 Indian languages.

Textual Entailment/Natural Language Inference

XNLI corpus: Hindi and Urdu test sets and machine translated training sets (from English MultiNLI).

Paraphrase

Amrita University-DPIL Corpus: Sentence level paraphrase identification for four Indian languages (Tamil, Malayalam, Hindi and Punjabi).

Sentiment, Sarcasm, Emotion Analysis

IIT Bombay movie review datasets for Hindi and Marathi
IIT Patna movie review datasets for Hindi
IIIT-H LTRC Multi-domain dataset for Telugu
ACTSA corpus for Telugu
BHAAV (भाव) Corpus: A Text Corpus for Emotion Analysis from Hindi Stories
SentiWordNet - SAIL - Hindi, Bangla, Tamil & Telugu
Dravidian-CodeMix - FIRE 2020 - Tamil & Malayalam
Bengali Sentiment Analysis - Classification Benchmark, 2020: 8k sentences
SentNoB: sentiment dataset for Bangla from 3 domains on user comments containing 15k examples (Paper) (Dataset)

Hate Speech and Offensive Comments

Hate Speech and Offensive Content Identification in Indo-European Languages: (HASOC FIRE-2020)
An Indian Language Social Media Collection for Hate and Offensive Speech, 2020: Hinglish Tweets and FB Comments collected during Parliamentary Election 2019 of India (Dataset available on request)
Aggression-annotated Corpus of Hindi-English Code-mixed Data, 2018: Scraped from Facebook (21k) & Twitter (18k) (Paper)
Did You Offend Me? Classification of Offensive Tweets in Hinglish Language, 2018: 3k tweets (Paper)
A Dataset of Hindi-English Code-Mixed Social Media Text for Hate Speech Detection, 2018: 4.5k Tweets (Paper)
Roman Urdu Offensive Language Detection, 2020: 10k tweets, can also used for Hindi, (Paper)
Bengali Hate Speech - Classification Benchmark, 2020: 1.5k sentences
Offensive Language Identification in Dravidian Languages, EACL 2021: Tamil, Malayalam, Kannada
Fear Speech in Indian WhatsApp Groups, 2021

Question Answering

Facebook Multilingual QA datasets: Contains dev and test sets for Hindi.
TyDi QA datasets: QA dataset for Bengali and Telugu.
bAbi 1.2 dataset: Has Hindi version of bAbi tasks in romanized Hindi.
MMQA dataset: Hindi QA dataset described in this paper
XQuAD: testset for Hindi QA from human translation of subset of SQuAD v1.1. Described in this paper
XQA: testset for Tamil QA. Described in this paper
HindiRC: A Dataset for Reading Comprehension in Hindi containing 127 questions and 24 passages. Described in this paper
IITH HiDG: A Distractor Generation Dataset for Hindi consisting of 1k/1k/5k (train/validation/test) split. Described in this paper
Chaii a Kaggle challenge which consists of 1104 Questions in Hindi and Tamil. Moreover, here is a good collection of papers on multilingual Question Answering.

Dialog

a-mma Indic Casual Dialogs Datasets

Discourse

MIDAS-Hindi Discourse Analysis

Information Extraction

EventXtract-IL: Event extraction for Tamil and Hindi. Described in this paper.
[EDNIL-FIRE2020]https://ednilfire.github.io/ednil/2020/index.html): Event extraction for Tamil, Hindi, Bengali, Marathi, English. Described in this paper.

Models

Word Embeddings

AI4Bharat IndicFT: Fast-text word embeddings for 11 Indian languages.
FastText CommonCrawl+Wikipedia
FastText Wikipedia
Polyglot

Pre-trained Language Models

AI4Bharat IndicBERT: Multilingual ALBERT based embeddings spanning 12 languages for Natural Language Understanding (including Indian English).
AI4Bharat IndicBART: Multilingual mBART based embeddings spanning 12 languages for Natural Language Generation (including Indian English).
MuRIL: Multilingual mBERT based embeddings spanning 17 languages and their transliterated counterparts for Natural Language Understanding (paper).
BERT Multilingual: BERT model trained on Wikipedias of many languages (including major Indic languages).
iNLTK: ULMFit and TransformerXL pre-trained embeddings for many languages trained on Wikipedia and some News articles.
albert-base-sanskrit: ALBERT-based model trained on Sanskrit Wikipedia.
RoBERTa-hindi-guj-san: Multilingual RoBERTa like model trained on Hindi, Sanskrit and Gujarati.
Bangla-BERT-Base: Bengali BERT model trained on Bengali wikipedia and OSCAR datasets

Multilingual Word Embeddings

Morphanalyzers

AI4Bharat IndicNLP Project: Unsupervised morphanalyzers for 10 Indian languages learnt using morfessor.

Translation Models

IndicTrans: Multilingual neural translation models for translation between English and 11 Indian languages. Supports translation between Indian langauges as well. A total of 110 translation directions are supported.
Shata-Anuvaadak: 110 language pairs
LTRC Vanee: Dependency based Statistical MT system from English to Hindi

Speech Models

AI4Bharat IndicWav2Vec: Multilingual pre-trained models for 40 Indian languages based on Wav2Vec 2.0.
Vakyansh CLSRIL-23: Pretrained wav2vec2 model trained on 10,000 hours of Speech data in 23 Indic Languages (documentation) (experimentation platform).
arijitx/wav2vec2-large-xlsr-bengali: Pretrained wav2vec2-large-xlsr trained on ~50 hrs(40,000 utterances) of OpenSLR Bengali data. Test WER 32.45% without LM.

Speech Corpora

Microsoft Speech Corpus: Speech corpus for Telugu, Tamil and Gujarati.
Microsoft-IITB Marathi Speech Corpus: 109 hours of speech data collected via crowdsourcing.
AccentDB: Database of Indian English accents from native speakers in Bangla, Malayalam, Telugu and Oriya.
IIT Madras TTS database
BABEL Speech Corpus: includes some Indian languages
Pratham ASER dataset: Dataset for research on reading level assessment.
WikiPron: Words and their pronunciations in IPA mined from Wiktionary. Includes Indian languages. paper
CVIT IndicSpeech: TTS data for 3 Indian languages: Malayalam, Bengali and Hindi (24 hours each).
Google Speech Corpus: TTS data for 6 Indian languages: Malayalam, Marathi, Telugu, Kannada, Gujarati, Tamil (upto 9 hours each). Resources SLR#63-#66, #78-#79. (paper)
CoVoST 2: Tamil 2 hrs data
SMC Malayalam Speech Corpus - Download link
Vāksañcayaḥ Sanskrit Speech Corpus : 78 hours of speech corpus in Sanskrit prose, with a speaker disjoint splits of train, dev and test. It also contains an additional out of domain test data with speakers having pronunciation influences from L1 (paper).

OCR Corpora

Multimodal Corpora

English-Hindi Visual Genome: Images captioned in both English and Hindi.
English-Hindi Flickr 8k: A subset of images from Flickr8k images captioned by native speakers in both English and Hindi. Code and data available here.

Language Specific Catalogs

Pointers to language-specific NLP resource catalogs

Priyansh2 / indicnlp_catalog