There are 6 repositories under low-resource-languages topic.
Resources for conservation, development, and documentation of low resource (human) languages.
This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021.
This repository contains the code and data of the paper titled "Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for Bengali-English Machine Translation" published in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), November 16 - November 20, 2020.
A repository for publicly/freely available Natural Language Processing (NLP) datasets for African languages.
Open-source benchmark datasets and pretrained transformer models in the Filipino language.
Python source code for EMNLP 2020 paper "Reusing a Pretrained Language Model on Languages with Limited Corpora for Unsupervised NMT".
Speech synthesis (TTS) in low-resource languages by training from scratch with Fastpitch and fine-tuning with HifiGan
NLP pipelines for Tagalog using spaCy
Exploring the Limits of Low-Resource Neural Machine Translation
This is an ASR corpus for Bemba language. It contains read speech from diverse publicly available Bemba sources; Literature Books, Radio/TV shows transcripts, Youtube Video transcripts, Online sources. The corpus has 14, 438 utterances culminating into over 24 hours of speech.
Curated list of publicly available parallel corpus for Indian Languages
Fine-tuning for automatic speech recognition on low-resource languages with character-based CTC model
This is a repository for NaijaSenti. A Lacuna Funded Project for the development of sentiment corpus for four Nigerian languages: Igbo, Hausa, Yoruba and Pidgin.
đź“– LanMIT: A Toolkit for Improving Language Models in Low-resourced Speech Recognition based on Kaldi.
My thesis on "Open Source Code and Low Resource Languages" for an MSc in Language Science and Technology at Saarland University
A pipeline to isolate and transcribe one language in mixed-language speech
Confidence Estimation for Black Box Automatic Speech Recognition Systems Using Lattice Recurrent Neural Networks https://arxiv.org/abs/1910.11933 or https://ieeexplore.ieee.org/document/9053264
Fake news detection in Filipino via Multitask Transfer Learning
Repository for multilingual speech data resources for native languages of Zambia.
Automated Speech Recognition for Chichewa.
Workflow for forced alignment between languages
MC^2: A Multilingual Corpus of Minority Languages in China (Tibetan, Uyghur, Kazakh, and Mongolian)
This is a repository for the IGBONLP Project.
Transfer learning for ASR with subword encoding CTC model (NVIDIA NeMo Citrinet) on low-resource languages
[ACL 2021, Findings] Cognate Prediction Per Machine Translation
Low resource machine translation using Transformers and Iterative Back translation
Enhanced awesome-align for low-resource languages and noise simulation: https://arxiv.org/abs/2301.09685
[AAAI 2021] - Simple or Complex? Learning to Predict Readability of Bengali Texts.
Graph Convolutional Network for Swahili News Classification: https://arxiv.org/abs/2103.09325
A set of frameworks for creating the AI/ML building blocks for low-resource languages.