franciellevargas/HausaHate

benchmark corpus dataset hate-speech hausa-nlp low-resource-languages machine-learning natural-language-processing nlp-machine-learning offensive-language

HausaHate: A Benchmarch Dataset for Hausa Hate Speech Detection

In African countries, the hate speech phenomenon is especially serious due to a historical problem regarding ethnic conflicts. Specifically, the Western region still lacks more research on hate speech focusing on its indigenous languages. Moreover, as most of the existing hate speech data resources are developed for the English language, the research and development of hate speech technologies for African indigenous languages are less developed. To fill this relevant gap, we introduce the first expert annotated corpus of Facebook comments for Hausa hate speech detection. The corpus titled HausaHate comprises 2,000 comments extracted from Western African Facebook pages and manually annotated by three Hausa native speakers, who are also NLP experts. Our corpus was annotated using two different layers. We first labeled each comment according to a binary classification: offensive versus non-offensive. Then, offensive comments were also labeled according to hate speech targets: race, gender and none. Lastly, a baseline model using fine-tuned LLM for Hausa hate speech detection is presented, highlighting the challenges of hate speech detection tasks for indigenous languages in Africa, as well as future advances. The following table describes in detail the HausaHate categories and documents:

Offensive	Non-Offensive	Total Comments
678	1,322	2,000

Race	Gender	Non-Target	Total
391	65	222	678

What the following is the list of collaborators and authors this project:

ETHICS STATEMENT

We followed the steps to anonymize the data described in Section 4.2.3 in the paper, as it is standard for papers with this kind of data. There is a public corpus of anonymized Facebook comments available. However, since the last change on the Meta platform terms of service was in 2020, we only decided to disclose the ids of the comments (only when requested) in order to allow the reproducibility, while also compelling researchers to pass through Meta’s authorization procedures to access the full data. Note that in order to keep the data anonymization, we publically provide the comments without their ids and links. Hence, please, contact francielleavargas@usp.br to request the corpus with ids and links of the comments.

CITING

Vargas, F., Guimarães, S., Muhammad, H. S., Alves, D., Ahmad, I. S., Abdulmumin, I., Mohamed, D., Pardo, T.A.S., Benevenuto, F. (2024). HausaHate: An Expert Annotated Corpus for Hausa Hate Speech Detection. Proceedings of the 8th Workshop on Online Abuse and Harms (NAACL 2024). pp.1--7. Mexico City, Mexico. Association for Computational Linguistics (ACL).

BIBTEX

FUNDING

About

HausaHate is a benchmark dataset for Hausa hate speech detection task. it was extracted from West African Facebook pages and comprises 2,000 comments annotated according to a binary class (offensive and non-offensive) and hate speech targets (race, gender and none).

benchmark corpus dataset hate-speech hausa-nlp low-resource-languages machine-learning natural-language-processing nlp-machine-learning offensive-language