dadelani / africanlp-public-datasets

A repository for publicly/freely available Natural Language Processing (NLP) datasets for African languages.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

AfricaNLP-Public-Datasets

A repository for publicly/freely available Natural Language Processing (NLP) datasets for African languages.

Datasets per task (Randomly ordered)

Machine Translation

  • JW300: A parallel text dataset of 417 languages, including 101 African languages.

  • TANZIL: A translated Quran to 45 languages, including African languages such as Amharic, Hausa, Somali, and Swahili.

  • MENYO-20k: A Yorùbá-English multi-domain parallel text dataset.

  • FFR: A Fon-French parallel text dataset.

  • Hausa Corpus: A Hausa-English parallel text dataset.

  • CCAligned: A parallel text dataset for English and 137 languages, including 30 African Languages.

  • ParaCrawl: A parallel text dataset for 41 languages, including Somali and Swahili.

  • WikiMatrix: A parallel text dataset for 85 languages, including Swahili, Malagasy, and Egyptian Arabic.

  • Ethiopian MT datasets: A parallel text dataset for English paired with 7 Ethiopian languages.

  • English-Luganda: An English-Luganda parallel text dataset.

  • French-Fon and French-Ewe: A parallel text dataset for French paired with Fon and Ewe.

  • Amharic-English: An Amharic-English parallel text dataset.

  • Tigrinya-English: A Tigrinya-English parallel text dataset (Free registration required).

  • Lingala-French: A Lingala-English parallel text dataset (Free registration required).

  • Congolese Swahili-French (Min,Small,Medium): Congolese Swahili-French parallel text datasets (Free registration required).

  • Swahili-French: A synthetic Swahili-French parallel text dataset (Free registration required).

  • English-Hausa (Min, Small): English-Hausa parallel text datasets (Free registration required).

  • English-Swahili: An English-Swahili parallel text dataset (Free registration required).

  • English-Kanuri: An English-Kanuri parallel text dataset (Free registration required).

  • English-Akuapem Twi: An English-Akwapem Twi parallel text dataset.

  • FLORES-101: A parallel text dataset for 101 languages, including 18 African languages.

Text Classification

Sentiment Analysis

  • TUNIZI: A Tunizian Arabizi sentiment analysis dataset.

Text Summarization

Named Entity Recognition

  • MasakhaNER: A dataset for Named Entity Recognition of 10 African languages.

  • WikiANN: A dataset for Named Entity Recognition for 282 languages, including several African languages.

  • Yoruba GV NER: Yoruba Named Entity Recognition dataset.

  • Hausa VOA NER: Hausa Named Entity Recognition dataset

Automated Speech Recognition (ASR)

  • ALFFA: An ASR dataset for Amharic, Hausa, Swahili, and Wolof.

  • AMMI ASR dataset: An ASR dataset for 19 Languages, including 16 African Languages.

  • CommonVoice: An ongoing ASR dataset project for 60 languages (as of May, 2021), including Kinyarwanda, Kabyle, and Luganda.

  • Fon: An ASR dataset for Fon.

  • Swahili: A Swahili speech dataset (Free registration required).

  • Congolese Swahili: A Congolese Swahili speech dataset (Free registration required).

Speech Translation

Monolingual Data

Contributions

This is a growing list of NLP datasets for African languages. Please, if there is any publicly available dataset I missed out, kindly feel free to do a pull request or email me at niyongabor.andre@gmail.com to add it.

About

A repository for publicly/freely available Natural Language Processing (NLP) datasets for African languages.