Massawe14 / 7TH-PLACE-SOLUTION-Lacuna-Masakhane-Parts-of-Speech-Classification-Challenge

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

7TH-PLACE-SOLUTION-Lacuna-Masakhane-Parts-of-Speech-Classification-Challenge

Description:

Part-of-speech (POS) tagging is a crucial step in natural language processing (NLP), as it allows algorithms to understand the grammatical structure and meaning of a text. This is especially important in creating the building blocks for preparing low-resource African languages for NLP tasks. The MaseakhaPOS dataset for 20 typologically diverse African languages, including benchmarks, was created with the help of Lacuna Fund to try and address this problem.

The objective of this challenge is to create a machine learning solution that correctly classifies 14 parts of speech for the unrelated Luo and Setswana languages. You will need to build one solution that applies to both languages, not two solutions, one for each language.

It is important that only one solution be built for both languages as this is a step in creating a solution that can be applied to many different languages, instead of having to create a model for each language.

This challenge is also important for Lacuna Fund, to help reach their goals of making ML-ready datasets available from low- and middle-income contexts. Masakhane is a grassroots organisation whose mission is to strengthen and spur NLP research in African languages, for Africans, by Africans. Despite the fact that 2000 of the world’s languages are African, African languages are barely represented in technology. The tragic past of colonialism has been devastating for African languages in terms of their support, preservation and integration. This has resulted in technological space that does not understand our names, our cultures, our places, our history.

Masakhane roughly translates to “We build together” in isiZulu. Our goal is for Africans to shape and own these technological advances towards human dignity, well-being and equity, through inclusive community building, open participatory research and multidisciplinary.

About the Datasets:

The training set of 19 languages is available at this repo: https://github.com/masakhane-io/masakhane-pos

Use this starter notebook to get started: https://github.com/masakhane-io/masakhane-pos/blob/main/train_pos.ipynb

The test set contains 17 parts of speech from Luo and 17 parts of speech from Setswana. Both these languages are unseen in the training set.

You can read more about the dataset and some idea that have worked in the past in this paper (https://arxiv.org/pdf/2305.13989.pdf). However, you are encouraged to come up with your own methods.

About


Languages

Language:Jupyter Notebook 100.0%