A Catalog of resources for Indian language NLP
Please suggest any other resources you may be aware of. Raise a pull request or an issue to add more resources to the catalog. Put the proposed entry in the following format:
[Wikipedia Dumps](https://dumps.wikimedia.org/)
Add a small, informative description of the dataset and provide links to any paper/article/site documenting the resource. Mention your name too. We would like to acknowlege your contribution to building this catalog in the CONTRIBUTORS list.
🆕 Added Evaluation Benchmarks sections
👍 Featured Resources
- IIT Bombay English-Hindi Parallel Corpus: Largest en-hi parallel corpora in public domain (about 1.5 million semgents)
- CVIT-IIITH PIB Multilingual Corpus: Mined from Press Information Bureau for many Indian languages. Contains both English-IL and IL-IL corpora (IL=Indian language).
- CVIT-IIITH Mann ki Baat Corpus: Mined from Indian PM Narendra Modi's Mann ki Baat speeches.
- AI4Bharat IndicNLP Project: Text corpora, word embeddings, text classification datasets for Indian languages.
- iNLTK: iNLTK aims to provide out of the box support for various NLP tasks that an application developer might need for Indic languages.
- Dakshina Dataset: The Dakshina dataset is a collection of text in both Latin and native scripts for 12 South Asian languages. Contains an aggregate of around 300k word pairs and 120k sentence pairs. Useful for transliteration.
Browse the entire catalog...
🙋Note: Many known resources have not yet been classified into the catalog. They can be found as open issues in the repo.
- Major Indic Language NLP Repositories
- Libraries
- Evaluation Benchmarks
- Text Corpora
- Unicode Standard
- Monolingual Corpus
- Language Identification
- Lexical Resources
- NER Corpora
- Parallel Translation Corpus
- Parallel Transliteration Corpus
- Textual Entailment
- Paraphrase
- Sentiment, Sarcasm, Emotion Analysis
- Question Answering
- Dialog
- Discourse
- POS Tagged corpus
- Chunk Corpus
- Dependency Parse Corpus
- Models
- Speech Corpora
- OCR Corpora
- Multimodal Corpora
- Language Specific Catalogs
Major Indic Language NLP Repositories
- Technology Development for Indian Languages (TDIL)
- Center for Indian Language Technology (CFILT)
- Language Technologies Research Center (LTRC)
- Linguistic Data Consortium For Indian Languages (LDCIL)
- University of Hyderabad - Sanskrit NLP
Libraries
- Indic NLP Library: Python Library for various Indian language NLP tasks like tokenization, sentece splitting, normalization, script conversion, transliteration, etc
- pyiwn: Python Interface to IndoWordNet
- Indic-OCR : OCR for Indic Scripts
- CLTK: Toolkit for many of the world's classical languages. Support for Sanskrit. Some parts of the Sanskrit library are forked from the Indic NLP Library.
- iNLTK: iNLTK aims to provide out of the box support for various NLP tasks that an application developer might need for Indic languages.
Evaluation Benchmarks
Benchmarks spanning multiple tasks.
- GLUECoS: For Hindi-English code-mixed data containing the following tasks - Language Identification (LID), POS Tagging (POS), Named Entity Recognition (NER), Sentiment Analysis (SA), Question Answering (QA), Natural Language Inference (NLI).
- AI4Bharat Text Classification: A compilation of classification datasets for 10 languages.
Text Corpora
Unicode Standard
Monolingual Corpus
- Wikipedia Dumps
- Common Crawl
- OSCAR Corpus: Released in 2019, large-scaled processed CommonCrawl.
- WMT Common Crawl Dumps: Crawls between 2012 and 2016. Noisy text, needs to be filtered.
- WMT NEWS Crawl
- LDCIL Monolingual Corpus
- Charles University Hindi Monolingual Corpus
- Charles University Urdu Monolingual Corpus
- IIT Bombay Hindi Monolingual Corpus
- EMILLE Corpus (multiple Indian languages)
- Janmabhumi Malayalam Corpus
- Leipzig Corpus
- Sanskrit Monolingual and Sandhi-split Corpus
- Lot Of Indic Tweets Corpus: Large twitter datasets for telugu (7.9 million) and hindi (17.6 million) and fasttext skipgram and cbow word vectors for the same.
- CMU Romanized Hinglish Corpus: See THIS PAPER for details.
- JNU-BHLTR Bhojpuri Corpus: Bhojpuri corpus of 45k sentences.
- KMI Magahi Corpus:
- KMI Awadhi Corpus:
Language Identification
- VarDial 2018 Language Identification Dataset: 5 languages - Hindi, Braj, Awadhi, Bhojpuri, Magahi.
Lexical Resources
- IndoWordNet
- IIIT-Hyderabad Word Similarity Database: 7 Indian languages
- Facebook Hindi Analogy Dataset
- MGAD Hindi Analogy dataset
NER Corpora
- FIRE 2013 AUKBC NER Corpus
- FIRE 2014 AUKBC NER Corpus
- IIT Bombay Marathi NER Corpus
- WikiAnn NER Corpus (Noisy)
- a-mma NER data
Parallel Translation Corpus
- IIT Bombay English-Hindi Parallel Corpus: Largest en-hi parallel corpora in public domain (about 1.5 million semgents)
- CVIT-IIITH PIB Multilingual Corpus: Mined from Press Information Bureau for many Indian languages. Contains both English-IL and IL-IL corpora (IL=Indian language).
- CVIT-IIITH Mann ki Baat Corpus: Mined from Indian PM Narendra Modi's Mann ki Baat speeches.
- PMIndia: Parallel corpus for En-Indian languages mined from Mann ki Baat speeches of the PM of India (paper).
- Indian Language Corpora Initiative: Available on TDIL portal on request
- OPUS corpus
- WAT 2018 Parallel Corpus: There may significant overlap between WAT and OPUS.
- Charles University English-Hindi Parallel Corpus: This is included in the IITB parallel corpus.
- Charles University English-Tamil Parallel Corpus
- Charles University English-Odia Parallel Corpus v1.0
- Charles University English-Odia Parallel Corpus v2.0
- Charles University English-Urdu Religious Parallel Corpus
- IndoWordnet Parallel Corpus: Parallel corpora mined from IndoWordNet gloss and/or examples for Indian-Indian language corpora (6.3 million segments, 18 languages).
- MTurk Indian Parallel Corpus
- TED Parallel Corpus
- JW300 Corpus: Parallel corpus mined from jw.org. Religious text from Jehovah's Witness.
- ALT Parallel Corpus: 10k sentences for Bengali, Hindi in parallel with English and many East Asian languages.
- FLORES dataset: English-Sinhala and English-Nepali corpora
- Uka Tarsadia University Corpus: 65k English-Gujarati sentence pairs. Corpus is described in this paper
- NLPC-UoM English-Tamil Corpus: 9k sentences, 24k glossary terms
- English-Tamil Wiki Titles: from statmt
- JNU-BHLTR Bhojpuri Corpus: English-Bhojpuri corpus of 65k sentences
- EILMT Corpus
- WikiMatrix Corpus: Mined from Wikipedia, looks noisy.
- CCMatrix: Parallel corpus mined from CommonCrawl, looks noisy.
Parallel Transliteration Corpus
- Dakshina Dataset: The Dakshina dataset is a collection of text in both Latin and native scripts for 12 South Asian languages. Contains an aggregate of around 300k word pairs and 120k sentence pairs.
- BrahmiNet Corpus: 110 language pairs mined from ILCI parallel corpus.
- Xlit-Crowd: Hindi-English Transliteration Corpus created via crowdsourcing.
- Xlit-IITB-Par: Hindi-English Transliteration Corpus mined from parallel translation corpora.
- FIRE 2013 Track on Transliterated Search: Transliteration dataset of native words in Hindi, Bengali and Gujarati.
- NEWS 2016 Shared Task dataset: Transliteration datasets for Kannada, Tamil, Bengali and Hindi created by Microsoft Research India.
- NotAI-tech English-Telugu: Around 38k word pairs
Text Classification
- BBC news articles classification dataset: 14 class classification
- iNLTK News Headlines classification: Datasets for multiple Indian languages.
- AI4Bharat IndicNLP News Articles: Word embeddings for 10 Indian languages.
Textual Entailment
- XNLI corpus: Hindi and Urdu test sets and machine translated training sets (from English MultiNLI).
Paraphrase
- Amrita University-DPIL Corpus: Sentence level paraphrase identification for four Indian languages (Tamil, Malayalam, Hindi and Punjabi).
Sentiment, Sarcasm, Emotion Analysis
- IIT Bombay movie review datasets for Hindi and Marathi
- IIT Patna movie review datasets for Hindi
- IIIT-H LTRC Multi-domain dataset for Telugu
- ACTSA corpus for Telugu
- BHAAV (भाव) Corpus: A Text Corpus for Emotion Analysis from Hindi Stories
Question Answering
- Facebook Multilingual QA datasets: Contains dev and test sets for Hindi.
- TyDi QA datasets: QA dataset for Bengali and Telugu.
- bAbi 1.2 dataset: Has Hindi version of bAbi tasks in romanized Hindi.
- MMQA dataset: Hindi QA dataset described in this paper
- XQuAD: testset for Hindi QA from human translation of subset of SQuAD v1.1. Described in this paper
Dialog
Discourse
Information Extraction
- EventXtract-IL: Event extraction for Tamil and Hindi. Described in this paper.
POS Tagged corpus
- Indian Language Corpora Initiative
- Universal Dependencies
- Code Mixed Dataset for Hindi, Bengali and Telugu, ICON 2016 shared task
- JNU-BHLTR Bhojpuri Corpus: Bhojpuri corpus of 5000 sentences.
- KMI Magahi Corpus:
- KMI Awadhi Corpus:
Chunk Corpus
Dependency Parse Corpus
- IIIT Hyderabad Hindi Treebank
- Universal Dependencies
- Universal Dependencies Hindi Treebank
- Universal Dependencies Urdu Treebank
Models
Word Embeddings
- FastText CommonCrawl+Wikipedia
- FastText Wikipedia
- Polyglot
- AI4Bharat IndicNLP Project: Word embeddings for 10 Indian languages.
Sentence Embeddings
- BERT Multilingual: BERT model trained on Wikipedias of many languages (including major Indic languages).
- iNLTK: ULMFit and TransformerXL pre-trained embeddings for many languages trained on Wikipedia and some News articles.
- albert-base-sanskrit: ALBERT-based model trained on Sanskrit Wikipedia.
- RoBERTa-hindi-guj-san: Multilingual RoBERTa like model trained on Hindi, Sanskrit and Gujarati.
Multilingual Word Embeddings
Morphanalyzers
- AI4Bharat IndicNLP Project: Unsupervised morphanalyzers for 10 Indian languages learnt using morfessor.
SMT Models
- Shata-Anuvaadak: 110 language pairs
- LTRC Vanee: Dependency based Statistical MT system from English to Hindi
Speech Corpora
- Microsoft Speech Corpus: Speech corpus for Telugu, Tamil and Gujarati.
- Microsoft-IITB Marathi Speech Corpus: 109 hours of speech data collected via crowdsourcing.
- AccentDB: Database of Indian English accents from native speakers in Bangla, Malayalam, Telugu and Oriya.
- IIT Madras TTS database
- BABEL Speech Corpus: includes some Indian languages
- Pratham ASER dataset: Dataset for research on reading level assessment.
OCR Corpora
Multimodal Corpora
- English-Hindi Visual Genome: Images captioned in both English and Hindi.
- English-Hindi Flickr 8k: A subset of images from Flickr8k images captioned by native speakers in both English and Hindi. Code and data available here.
Language Specific Catalogs
Pointers to language-specific NLP resource catalogs