MGloder / vakyansh-models

Open source speech to text models for Indic Languages

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Vakyansh Open Source Models

TTS Models (Repo)

Below models are trained using Glow TTS and hifi GAN combination.

Language Gender glow ckpt hifi-gan ckpt
Hindi Female voice_0_glow voice_0_hifi
Hindi Male voice_1_glow voice_1_hifi

Pretrained ASR Models

Pretrained Model Description Architecture Pretrained Hours
CLSRIL-23 Cross Lingual Speech Representations for Indic Languages, Contains 10,000 hours of training data from 23 Indic Languages.
Citation: https://arxiv.org/abs/2107.07402
Base 10,000
hindi_pretrained_4kh Trained on 4200 hours of Hindi Data Base 4200
kannada_pretrained_1400h Trained on 1400 hours of Kannada data XLSR 1400

Finetuned ASR Models (works on v2-hydra branch*)

Language Pretrained Model Finetuned Model Dictionary Single Model for Inference Finetuned Hours TS model
Hindi CLSRIL-23 him_4200 dict hindi_infer 4200 h hindi_ts
Indian English CLSRIL-23 enm_700 dict english_infer 700 h english_ts
Kannada CLSRIL-23 knm_560 dict kannada_infer 560 h kannada_ts
Tamil CLSRIL-23 tam_250 dict tamil_infer 250 h tamil_ts
Bengali CLSRIL-23 bnm_200 dict bengali_infer 200 h bengali_ts
Nepali CLSRIL-23 nem_130 dict nepali_infer 130 h nepali_ts
Telugu CLSRIL-23 tem_100 dict telugu_infer 100 h telugu_ts
Gujarati CLSRIL-23 gum_100 dict gujarati_infer 100 h gujarati_ts
Marathi CLSRIL-23 mrm_100 dict marathi_infer 100 h
Odia CLSRIL-23 orm_100 dict odia_infer 100 h
Sanskrit CLSRIL-23 sam_60 dict sanskrit_infer 60 h
Maithili CLSRIL-23 maim_50 dict maithili_infer 50 h
Urdu CLSRIL-23 urm_60h dict urdu_infer 60h
Punjabi CLSRIL-23 pam_10h dict punjabi_infer 10 h
Dogri CLSRIL-23 doi_55h dict dogri_infer 55 h
Malayalam CLSRIL-23 mlm_8h dict malayalam_infer 8 h
Bhojpuri CLSRIL-23 bhom_60h dict bhojpuri_infer 60 h
Rajasthani CLSRIL-23 raj_45h dict rajasthani_infer 45 h
Assamese CLSRIL-23 asm_8h dict assamese_infer 8 h

Language Models (Works with Finetuned ASR Models)

Data is taken from AI For Bharat Corpus but we do post processing by tokenizing and removing duplicates.

Language Type Lexicon LM
Hindi kenlm 5-gram hindi_lexicon hindi_lm
Indian English kenlm 5-gram english_lexicon english_lm
Kannada kenlm 5-gram kannada_lexicon kannada_lm
Tamil kenlm 5-gram tamil_lexicon tamil_lm
Bengali kenlm 5-gram bengali_lexicon bengali_lm
Nepali kenlm 5-gram nepali_lexicon nepali_lm
Telugu kenlm 5-gram telugu_lexicon telugu_lm
Gujarati kenlm 5-gram gujarati_lexicon gujarati_lm
Marathi kenlm 5-gram marathi_lexicon marathi_lm
Odia kenlm 5-gram odia_lexicon odia_lm
Sanskrit kenlm 5-gram sanskrit_lexicon sanskrit_lm
Maithili kenlm 5-gram maithili_lexicon maithili_lm
Urdu kenlm 5-gram urdu_lexicon urdu_lm
Punjabi kenlm 5-gram punjabi_lexicon punjabi_lm
Dogri kenlm 5-gram dogri_lexicon dogri_lm
Malayalam kenlm 5-gram malayalam_lexicon malayalam_lm
Bhojpuri kenlm 5-gram bhojpuri_lexicon bhojpuri_lm
Rajasthani kenlm 5-gram rajasthani_lexicon rajasthani_lm
Assamese kenlm 5-gram assamese_lexicon assamese_lm

Domain Specific Language Models

Language Type Domain Lexicon LM
English kenlm 5-gram Biomedical bio_lexicon bio_lm


Citation

If you use any of our resources, please cite the following article.

@misc{gupta2021clsril23,
      title={CLSRIL-23: Cross Lingual Speech Representations for Indic Languages}, 
      author={Anirudh Gupta and Harveen Singh Chadha and Priyanshi Shah and Neeraj Chimmwal and Ankur Dhuriya and Rishabh Gaur and Vivek Raghavan},
      year={2021},
      eprint={2107.07402},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

Open source speech to text models for Indic Languages

License:MIT License