low-resource-languages

There are 7 repositories under low-resource-languages topic.

RichardLitt / low-resource-languages
Resources for conservation, development, and documentation of low resource (human) languages.
endangered-languages natural-language language-resources human-language natural-language-processing language-learning language-documentation resourced-languages awesome-list awesome list minority-language low-resource-languages lrls nlp
Language:TeX 380
csebuetnlp / xl-sum
This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021.
abstractive-summarization abstractive-text-summarization dataset deep-learning low-resource-languages low-resource-summarization low-resource-text-summarizarion machine-learning multilingual multilingual-summarization multilingual-text-summarization multilinguality summarization-corpora summarization-dataset text-summarisation text-summarization text-summarization-dataset text-summarization-model
Language:Python 249
csebuetnlp / banglanmt
This repository contains the code and data of the paper titled "Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for Bengali-English Machine Translation" published in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), November 16 - November 20, 2020.
bangla-nlp machine-translation parallel-corpus parallel-corpora neural-machine-translation bangla-dataset-machine-translation bangla-machine-translation low-resource-languages emnlp-2020 low-resource-nlp low-resource-machine-translation
Language:Python 143
Andrews2017 / africanlp-public-datasets
A repository for publicly/freely available Natural Language Processing (NLP) datasets for African languages.
african-languages datasets low-resource-languages natural-language-processing
84
cisnlp / GlotLID
GlotLID: Language Identification with Support for More Than 2000 Labels -- EMNLP 2023
language-detection language-detection-lib language-detection-library language-detector language-identification language-identification-toolkit language-identifier langid lid low-resource-languages low-resource-nlp multlingual language-classification language-recognition glot
Language:Python 79
jcblaisecruz02 / Filipino-Text-Benchmarks
Open-source benchmark datasets and pretrained transformer models in the Filipino language.
transformer transfer-learning bert tagalog filipino corpus benchmark deep-learning tagalog-transformers text-classification electra electra-models nli low-resource-languages
Language:Python 58
ljvmiranda921 / calamanCy
NLP pipelines for Tagalog using spaCy
computational-linguistics low-resource-languages low-resource-nlp machine-learning natural-language-processing ner nlp spacy
Language:Python 41
Rumeysakeskin / Turkish-Text-to-Speech
Speech synthesis (TTS) in low-resource languages by training from scratch with Fastpitch and fine-tuning with HifiGan
fastpitch hifigan low-resource-languages pytorch speech-synthesis tts phonetical-conversion turkish-text-to-speech nvidia-docker spectrogram-generator nvidia-nemo waveform-generator
Language:Python 41
kbatsuren / CogNet
CogNet: a large-scale, high-quality cognate database for 338 languages, 1.07M words, and 8.1 million cognates
cognate multilinguality cross-lingual-simialrity wordnet language-resources low-resource-languages corpus-linguistics bilingual-lexicon-extraction bilingual-lexicon-induction cross-lingual-transfer
40
alexandra-chron / relm_unmt
Python source code for EMNLP 2020 paper "Reusing a Pretrained Language Model on Languages with Limited Corpora for Unsupervised NMT".
transfer-learning language-models cross-lingual pretraining low-resource-languages unsupervised-machine-translation residual-adapters
Language:Python 35
cdli-gh / Semi-Supervised-NMT-for-Sumerian-English
Exploring the Limits of Low-Resource Neural Machine Translation
low-resource-languages nmt unsupervised semi-supervised backtranslation transformers xlm translation
Language:Jupyter Notebook 32
csikasote / BembaSpeech
This is an ASR corpus for Bemba language. It contains read speech from diverse publicly available Bemba sources; Literature Books, Radio/TV shows transcripts, Youtube Video transcripts, Online sources. The corpus has 14, 438 utterances culminating into over 24 hours of speech.
automatic-speech-recognition low-resource-languages bemba
31
Kartikaggarwal98 / Indian_ParallelCorpus
Curated list of publicly available parallel corpus for Indian Languages
machinetranslation indian-languages nlp corpus parallel-corpus parallel-corpora multilingual-translation low-resource-machine-translation low-resource-languages neural-machine-translation
29
hausanlp / NaijaSenti
This is a repository for NaijaSenti. A Lacuna Funded Project for the development of sentiment corpus for four Nigerian languages: Igbo, Hausa, Yoruba and Pidgin.
hausa hausanlp nlp sentiment-analysis dataset low-resource-nlp hausa-nlp igbo igbo-language low-resource-languages nigeria nigerian-data yoruba yorubaname-dictionary african-languages sentiment sentiment-classification sentiment-data
25
charlesliucn / LanMIT
📖 LanMIT: A Toolkit for Improving Language Models in Low-resourced Speech Recognition based on Kaldi.
kaldi-asr keyword-spotting language-modeling low-resource-languages speech-recognition speech-to-text
Language:C++ 21
RichardLitt / thesis
My thesis on "Open Source Code and Low Resource Languages" for an MSc in Language Science and Technology at Saarland University
thesis dissertation endangered-languages saarland saarland-university lrl low-resource-languages nlp nlproc
Language:TeX 20
vad-sli-asr
CoEDL / vad-sli-asr
A pipeline to isolate and transcribe one language in mixed-language speech
automatic-speech-recognition endangered-languages low-resource-languages spoken-language-identification voice-activity-detection
Language:Python 18
Aditi138 / EntityTargetedActiveLearning
nlp low-resource-languages named-entity-recognition transfer-learning active-learning
Language:Python 17
alecokas / BiLatticeRNN-Confidence
Confidence Estimation for Black Box Automatic Speech Recognition Systems Using Lattice Recurrent Neural Networks https://arxiv.org/abs/1910.11933 or https://ieeexplore.ieee.org/document/9053264
attention lstm speech-processing confidence-scores confidence-estimation pytorch confusion-networks low-resource-languages latticernn lattice lattices pytorch-implementation confidence-estimates speech-recognition asr
Language:Python 16
jhdeov / interlingual-MFA
Workflow for forced alignment between languages
forced-alignment low-resource-languages montreal-forced-aligner multilingual-alignment cross-language cross-language-alignment
Language:Python 15
jcblaisecruz02 / Tagalog-fake-news
Fake news detection in Filipino via Multitask Transfer Learning
bert bert-model tagalog filipino low-resource-languages transformer nlp deep-learning multitask-learning
14
khuangaf / CONCRETE
Official implementation of "CONCRETE: Improving Cross-lingual Fact Checking with Cross-lingual Retrieval" (COLING'22)
cross-lingual-transfer fact-checking retrieval low-resource-languages multilinguality
Language:Python 14
surafelml / Afro-NMT
LOW-RESOURCE NEURAL MACHINE TRANSLATION: A BENCHMARK FOR FIVE AFRICAN LANGUAGES
neural-machine-translation low-resource-languages multilingaul transfer-learning
Language:Shell 14
unza-speech-lab / zambezi-voice
Repository for multilingual speech data resources for native languages of Zambia.
low-resource-languages speech-recognition speech-to-text zambia
14
dmatekenya / Chichewa-Speech2Text
Automated Speech Recognition for Chichewa.
asr chichewa low-resource-languages nlp
Language:Jupyter Notebook 13
EveryVoiceTTS / EveryVoice
The EveryVoice TTS Toolkit - Text To Speech for your language
language-revitalization low-resource-languages python pytorch pytorch-lightning speech speech-processing speech-synthesis text-to-speech tts
Language:Python 13
luciusssss / mc2_corpus
[ACL'24] MC^2: A Multilingual Corpus of Minority Languages in China (Tibetan, Uyghur, Kazakh, and Mongolian)
corpus kazakh low-resource-languages low-resource-nlp mongolian multilingual natural-language-processing tibetan uyghur tibetan-nlp
Language:Python 13
luciusssss / ZhuangBench
[ACL'24 Findings] Teaching Large Language Models an Unseen Language on the Fly
large-language-models llm low-resource-languages low-resource-nlp zhuang
Language:Python 13
IgnatiusEzeani / IGBONLP
This is a repository for the IGBONLP Project.
igbo-language nlp machine-translation deep-learning low-resource-languages
Language:Modula-3 11
BatsResearch / LexC-Gen
Generate synthetic labeled data for extremely low-resource languages using bilingual lexicons.
lexicon-based llm low-resource-languages sentiment-analysis synthetic-data synthetic-dataset-generation topic-modeling multilingual multilingual-nlp
Language:Python 10
clefourrier / CopperMT
[ACL 2021, Findings] Cognate Prediction Per Machine Translation
acl2021 cognate-prediction cognates nmt smt machine-translation low-resource-languages low-resource-machine-translation fairseq
Language:JavaScript 10
fajri91 / minangNLP
Minangkabau NLP corpus. PACLIC 2020
sentiment-analysis machine-translation corpus minangkabau-language nlp bert low-resource-languages indonesian-language ethnicity paclic
Language:Python 10
harmanpreet93 / low-resource-machine-translation
Low resource machine translation using Transformers and Iterative Back translation
nmt low-resource-languages machine-translation transformer-models bert-embeddings back-translation nlp
Language:Python 10
tafseer-nayeem / BengaliReadability
[AAAI 2021] - Simple or Complex? Learning to Predict Readability of Bengali Texts.
bengali-nlp bengali-natural-language-processing bengali-language-processing bengali-readability bengali-readability-analysis bengali-readability-prediction readability-dataset bengali-readability-dataset low-resource-languages
Language:Python 9
ofdn / OpenSpeaks-Before-AI
A set of frameworks for creating the AI/ML building blocks for low-resource languages.
ai languages low-resource-languages ml
8
ruoyuxie / noisy_parallel_data_alignment
Enhanced awesome-align for low-resource languages and noise simulation: https://arxiv.org/abs/2301.09685
low-resource-languages low-resource-nlp noise noisy-data nueral-machine-translation ocr ocr-text word-aligner word-alignment
Language:Python 8

low-resource-languages

RichardLitt / low-resource-languages

csebuetnlp / xl-sum

csebuetnlp / banglanmt

Andrews2017 / africanlp-public-datasets

cisnlp / GlotLID

jcblaisecruz02 / Filipino-Text-Benchmarks

ljvmiranda921 / calamanCy

Rumeysakeskin / Turkish-Text-to-Speech

kbatsuren / CogNet

alexandra-chron / relm_unmt

cdli-gh / Semi-Supervised-NMT-for-Sumerian-English

csikasote / BembaSpeech

Kartikaggarwal98 / Indian_ParallelCorpus

hausanlp / NaijaSenti

charlesliucn / LanMIT

RichardLitt / thesis

CoEDL / vad-sli-asr

Aditi138 / EntityTargetedActiveLearning

alecokas / BiLatticeRNN-Confidence

jhdeov / interlingual-MFA

jcblaisecruz02 / Tagalog-fake-news

khuangaf / CONCRETE

surafelml / Afro-NMT

unza-speech-lab / zambezi-voice

dmatekenya / Chichewa-Speech2Text

EveryVoiceTTS / EveryVoice

luciusssss / mc2_corpus

luciusssss / ZhuangBench

IgnatiusEzeani / IGBONLP

BatsResearch / LexC-Gen

clefourrier / CopperMT

fajri91 / minangNLP

harmanpreet93 / low-resource-machine-translation

tafseer-nayeem / BengaliReadability

ofdn / OpenSpeaks-Before-AI

ruoyuxie / noisy_parallel_data_alignment