There are 7 repositories under low-resource-languages topic.
Resources for conservation, development, and documentation of low resource (human) languages.
This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021.
This repository contains the code and data of the paper titled "Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for Bengali-English Machine Translation" published in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), November 16 - November 20, 2020.
A repository for publicly/freely available Natural Language Processing (NLP) datasets for African languages.
GlotLID: Language Identification with Support for More Than 2000 Labels -- EMNLP 2023
Open-source benchmark datasets and pretrained transformer models in the Filipino language.
NLP pipelines for Tagalog using spaCy
Speech synthesis (TTS) in low-resource languages by training from scratch with Fastpitch and fine-tuning with HifiGan
Python source code for EMNLP 2020 paper "Reusing a Pretrained Language Model on Languages with Limited Corpora for Unsupervised NMT".
Exploring the Limits of Low-Resource Neural Machine Translation
This is an ASR corpus for Bemba language. It contains read speech from diverse publicly available Bemba sources; Literature Books, Radio/TV shows transcripts, Youtube Video transcripts, Online sources. The corpus has 14, 438 utterances culminating into over 24 hours of speech.
Curated list of publicly available parallel corpus for Indian Languages
This is a repository for NaijaSenti. A Lacuna Funded Project for the development of sentiment corpus for four Nigerian languages: Igbo, Hausa, Yoruba and Pidgin.
đź“– LanMIT: A Toolkit for Improving Language Models in Low-resourced Speech Recognition based on Kaldi.
My thesis on "Open Source Code and Low Resource Languages" for an MSc in Language Science and Technology at Saarland University
A pipeline to isolate and transcribe one language in mixed-language speech
Confidence Estimation for Black Box Automatic Speech Recognition Systems Using Lattice Recurrent Neural Networks https://arxiv.org/abs/1910.11933 or https://ieeexplore.ieee.org/document/9053264
Workflow for forced alignment between languages
Fake news detection in Filipino via Multitask Transfer Learning
Official implementation of "CONCRETE: Improving Cross-lingual Fact Checking with Cross-lingual Retrieval" (COLING'22)
Repository for multilingual speech data resources for native languages of Zambia.
Automated Speech Recognition for Chichewa.
The EveryVoice TTS Toolkit - Text To Speech for your language
[ACL'24] MC^2: A Multilingual Corpus of Minority Languages in China (Tibetan, Uyghur, Kazakh, and Mongolian)
[ACL'24 Findings] Teaching Large Language Models an Unseen Language on the Fly
This is a repository for the IGBONLP Project.
Generate synthetic labeled data for extremely low-resource languages using bilingual lexicons.
[ACL 2021, Findings] Cognate Prediction Per Machine Translation
Low resource machine translation using Transformers and Iterative Back translation
[AAAI 2021] - Simple or Complex? Learning to Predict Readability of Bengali Texts.
A set of frameworks for creating the AI/ML building blocks for low-resource languages.
Enhanced awesome-align for low-resource languages and noise simulation: https://arxiv.org/abs/2301.09685