Data Bootstrapping Approaches to Improve Low Resource Abusive Language Detection for Indic Languages

Solving the problem of abusive speech detection in 8 (10 types) languages from 14 publicly available sources.

New update -- 🎉 🎉 all our BERT models are available here. Be sure to check it out 🎉 🎉.

Please cite our paper in any published work that uses any of these resources.

@article{das2022data,
  title={Data Bootstrapping Approaches to Improve Low Resource Abusive Language Detection for Indic Languages},
  author={Das, Mithun and Banerjee, Somnath and Mukherjee, Animesh},
  journal={arXiv preprint arXiv:2204.12543},
  year={2022}
}

Folder Description 👈


./Dataset   --> Contains the dataset related details.
./Codes     --> Contains the codes

Requirements

Make sure to use Python3 when running the scripts. The package requirements can be obtained by running pip install -r requirements.txt.

Dataset

Check out the Dataset folder to know more about how we curated the dataset for different languages. ⚠️ There are few datasets which requires crawling them hence we can gurantee the retrieval of all the datapoints as tweets may get deleted. ⚠️

Models used for our task

m-BERT is pre-trained on 104 languages with the largest Wikipedia utilizing a masked language modeling (MLM) objective. It is a stack of transformer encoder layers with 12 ``attention heads," i.e., fully connected neural networks augmented with a self-attention mechanism. m-BERT is restricted in the number of tokens it can handle (512 at max). To fine-tune m-BERT, we also add a fully connected layer with the output corresponding to the CLS token in the input. This CLS token output usually holds the representation of the sentence passed to the model. The m-BERT model has been well studied in abusive speech, has already surpassed existing baselines, and stands as a state-of-the-art.
MuRIL stands for Multilingual Representations for Indian Languages and aims to improve interoperability from one language to another. This model uses a BERT base architecture pretrained from scratch utilizing the Wikipedia, Common Crawl, PMINDIA, and Dakshina corpora for 17 Indian languages and their transliterated counterparts.

Links to the individual model 👼

For more details about our paper

Mithun Das, Somnath Banerjee, and Animesh Mukherjee. 2022. "Data Bootstrapping Approaches to Improve Low Resource Abusive Language Detection for Indic Languages". ACM HT'22

hate-alert / IndicAbusive