Kartikaggarwal98 / Indian_ParallelCorpus

Curated list of publicly available parallel corpus for Indian Languages

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Parallel Corpus for Indian Languages

Available parallel data for training machine translation models in indic languages: Hindi, Bengali, Gujarati, Gondi, Kannada, Manipuri, Marathi, Malayalam, Oriya, Punjabi, Sanskrit, Tamil, Telugu.

Assamese-X

  1. Samaantar Corpus
  2. As-En PMIndia Corpus
  3. As-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row asm-eng.

Bengali-X

  1. Samaantar Corpus
  2. Bn-En BEUT Parallel corpus: 2.75million pairs of bengali-english sentences @EMNLP 2020
  3. Bn-En Project Anuvaad
  4. Bn-En Indian Parallel Corpora
  5. CVIT-IIITH PIB Multilingual Corpus: en, gu, hi, ml, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  6. CVIT-IIITH Mann ki Baat Corpus: en, gu, hi, ml, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  7. Bn-En Indian-Language Dataset
  8. Bn-En Asian Language Treebank (ALT) Parallel Corpus
  9. Bn-En PMIndia Corpus
  10. Bn-En OPUS: Set source as en and target as bn
  11. Bn-En SUPARA 0.8M: Requires an IEEE DataPort Subscription
  12. Bn-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row ben-eng.

Gujarati-X

  1. Samaantar Corpus
  2. Gu-En WikiTitles Parallel Corpus : wikititles-v1.gu-en.tsv.gz
  3. Gu-En Project Anuvaad
  4. Gu-En Tsardia
  5. CVIT-IIITH PIB Multilingual Corpus: en, bn, hi, ml, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  6. CVIT-IIITH Mann ki Baat Corpus: en, bn, hi, ml, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  7. Gu-En Shahparth123
  8. Gu-En PMIndia Corpus
  9. Gu-En Bible Corpus
  10. Gu-En OPUS: Set source as en and target as gu
  11. Gu-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row guj-eng.

Gondi-X

  1. Gondi-Hindi Parallel Corpus

Hindi-X

  1. Samaantar Corpus
  2. Hi-En IITB Parallel Corpus: v3.0 released !!
  3. Hi-En Project Anuvaad
  4. Hi-En Indian Parallel Corpora
  5. CVIT-IIITH PIB Multilingual Corpus: en, bn, gu, ml, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  6. CVIT-IIITH Mann ki Baat Corpus: en, bn, gu, ml, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  7. Hi-En Asian Language Treebank (ALT) Parallel Corpus
  8. Hi-En PMIndia Corpus
  9. Hi-En Bible Corpus
  10. Hi-En Wiki Matrix Comparable Corpus
  11. Hi-En OPUS: Set source as en and target as hi. [ Some of the corpus are part of IITB Parallel Corpus.]
  12. Hi-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row hin-eng.
  13. IIITH Code-Mix Hi-En Corpus
  14. Hi-En Flickr 8k: Multimodal Dataset
  15. Hi-San parallel corpus: Hindi-Sanskrit monolingual and parallel data from Ramayana, Rigveda, Bhagvad Gita, etc.

Kannada-X

  1. Samaantar Corpus
  2. Kn-En Project Anuvaad
  3. Kn-En PMIndia Corpus
  4. Kn-En Bible Corpus
  5. OPUS: Set source as en and target as kn
  6. Kn-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row kan-eng.

Manipuri-X

  1. Mn-En PMIndia Corpus

Marathi-X

  1. Samaantar Corpus
  2. Mr-En Project Anuvaad
  3. CVIT-IIITH PIB Multilingual Corpus: en, bn, gu, hi, ml, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  4. CVIT-IIITH Mann ki Baat Corpus: en, bn, gu, hi, ml, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  5. Mr-En PMIndia Corpus
  6. Mr-En Bible Corpus
  7. Mr-En OPUS: Set source as en and target as mr
  8. Mr-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row mar-eng.

Malayalam-X

  1. Samaantar Corpus
  2. Ml-en Project Anuvaad
  3. Indian Parallel Corpora
  4. CVIT-IIITH PIB Multilingual Corpus: en, bn, gu, hi, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  5. CVIT-IIITH Mann ki Baat Corpus: en, bn, gu, hi, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  6. Ml-en Indian-Language Dataset
  7. Ml-en English_Malayalam_ParallelCorpora
  8. Ml-en PMIndia Corpus
  9. Ml-en Bible Corpus
  10. Ml-en OPUS: Set source as en and target as ml
  11. Ml-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row mal-eng.

Oriya-X

  1. Samaantar Corpus
  2. Or-En MTEnglish2Odia
  3. Or-En OdiEnCorp 2.0
  4. Or-En OdiEnCorp 1.0
  5. Or-En IndoWordnet Parallel Corpus
  6. CVIT-IIITH PIB Multilingual Corpus: en, bn, gu, hi, ml, mr, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  7. CVIT-IIITH Mann ki Baat Corpus: en, bn, gu, hi, ml, mr, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  8. Or-En PMIndia Corpus
  9. Or-En OPUS: Set source as en and target as or
  10. Or-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row ori-eng.

Punjabi-X

  1. Samaantar Corpus
  2. Pu-En Project Anuvaad
  3. Pu-En Punjabi-English Corpus
  4. Pu-En PMIndia Corpus
  5. Pu-En OPUS: Set source as en and target as pa
  6. Pu-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row pan-eng.

Sanskrit-X

  1. San-Hi parallel corpus: Sanskrit Hindi monolingual and parallel data from Ramayana, Rigveda, Bhagvad Gita, etc.

Tamil-X

  1. Samaantar Corpus
  2. Ta-En Project Anuvaad
  3. Ta-En Indian Parallel Corpora
  4. Ta-En National Language Process Center
  5. Ta-En EnTam
  6. CVIT-IIITH PIB Multilingual Corpus: en, bn, gu, hi, ml, mr, or, pa, te, ur. [Source-code, pretrained models and other resources also available.]
  7. CVIT-IIITH Mann ki Baat Corpus: en, bn, gu, hi, ml, mr, or, pa, te, ur. [Source-code, pretrained models and other resources also available.]
  8. Ta-En Indian-Language Dataset
  9. Ta-En Multiple Dataset Links
  10. Ta-En PMIndia Corpus
  11. Ta-En Parallel Corpus
  12. Ta-En PMIndia Corpus
  13. Ta-En OPUS: Set source as en and target as ta
  14. Ta-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row tam-eng.

Telugu-X

  1. Samaantar Corpus
  2. Te-En Project Anuvaad
  3. Te-En Indian Parallel Corpora
  4. CVIT-IIITH PIB Multilingual Corpus: en, bn, gu, hi, ml, mr, or, pa, ta, ur. [Source-code, pretrained models and other resources also available.]
  5. CVIT-IIITH Mann ki Baat Corpus: en, bn, gu, hi, ml, mr, or, pa, ta, ur. [Source-code, pretrained models and other resources also available.]
  6. Te-En Indian-Language Dataset
  7. Te-En PMIndia Corpus
  8. Te-En Bible Corpus
  9. Te-En OPUS: Set source as en and target as te
  10. Te-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row tel-eng.

Other Resources

  1. PMIndia Parallel Corpus Creation: Code for creating a parallel corpus from pmindia.gov.in. [Paper Link]