ShashwatVv / DE-LIMIT

DeEpLearning models for MultIlingual haTespeech (DELIMIT): Benchmarking multilingual models across 9 languages and 16 datasets.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Hits contributions welcome

Deep Learning Models for Multilingual Hate Speech Detection

๐Ÿ‡ต๐Ÿ‡น ๐Ÿ‡ธ๐Ÿ‡ฆ ๐Ÿ‡ต๐Ÿ‡ฑ ๐Ÿ‡ฎ๐Ÿ‡ฉ ๐Ÿ‡ฎ๐Ÿ‡น Solving the problem of hate speech detection in 9 languages across 16 datasets. :fr: :us: :es: :de:

New update -- ๐ŸŽ‰ ๐ŸŽ‰ all our BERT models are available here. Be sure to check it out ๐ŸŽ‰ ๐ŸŽ‰.

Demo

Please look here to check model loading and inference.

Please cite our paper in any published work that uses any of these resources.

@inproceedings{aluru2021deep,
  title={A Deep Dive into Multilingual Hate Speech Classification},
  author={Aluru, Sai Saketh and Mathew, Binny and Saha, Punyajoy and Mukherjee, Animesh},
  booktitle={Machine Learning and Knowledge Discovery in Databases. Applied Data Science and Demo Track: European Conference, ECML PKDD 2020, Ghent, Belgium, September 14--18, 2020, Proceedings, Part V},
  pages={423--439},
  year={2021},
  organization={Springer International Publishing}
}

Folder Description ๐Ÿ‘ˆ


./Dataset             --> Contains the dataset related files.
./BERT_Classifier     --> Contains the codes for BERT classifiers performing binary classifier on the dataset
./CNN_GRU	      --> Contains the codes for CNN-GRU model		
./LASER+LR 	      --> Containes the codes for Logistic regression classifier used on top of LASER embeddings

Requirements

Make sure to use Python3 when running the scripts. The package requirements can be obtained by running pip install -r requirements.txt.


Dataset

Check out the Dataset folder to know more about how we curated the dataset for different languages. โš ๏ธ There are few datasets which requires crawling them hence we can gurantee the retrieval of all the datapoints as tweets may get deleted. โš ๏ธ


Models used for our this task

We release the code for train/finetuning the following models along with their hyperparamters.

๐Ÿฅ‡ best for high resource language , ๐Ÿ… best for low resource language

โœˆ๏ธ fastest to train , ๐Ÿ›ฉ๏ธ slowest to train

  1. mBERT Baseline: This setting consists of using multilingual bert model with the same language dataset for training and testing. Refer to BERT Classifier folder for the codes and usage instructions.

  2. mBERT All_but_one::1st_place_medal::small_airplane: This setting consists of using multilingual bert model with training dataset from multiple languages and validation and test from a single target language. Refer to BERT Classifier folder for the codes and usage instructions.

  3. Translation + BERT Baseline: This setting consists of translating the other language datasets to english and finetuning the bert-base model using this translated datasets. Refer to BERT Classifier folder for the codes and usage instructions.

  4. CNN+GRU Baseline: This setting consists of using MUSE word embeddings along with a CNN-GRU based model, and training and testing on the same language. Refer to CNN_GRU folder for the codes and usage instructions.

  5. LASER+LR baseline::airplane: This setting consists of training a logistic regression model on the LASER embeddings of the dataset. The training and testing dataset are from the same language. Refer to LASER+LR folder for the codes and usage instructions.

  6. LASER+LR all_but_one::medal_sports: This setting consists of training a logistic regression model on the LASER embeddings of the dataset. The dataset from other languages are also used to train the LR model. Refer to LASER+LR folder for the codes and usage instructions.

Blogs and github repos which we used for reference ๐Ÿ‘ผ

  1. Muse embeddding are downloaded and extracted using the code from MUSE github repository
  2. For finetuning BERT this blog by Chris McCormick is used and we also referred Transformers github repo
  3. For CNN-GRU model we used the original repo for reference
  4. For generating the LASER embeddings of the dataset, we used the code from LASER github repository

For more details about our paper

Sai Saketh Aluru, Binny Mathew, Punyajoy Saha and Animesh Mukherjee. 2020. "Deep Learning Models for Multilingual Hate Speech Detection". ECML-PKDD

Todos

  • Upload our models to transformers community to make them public
  • Add arxiv paper link and description
  • Create an interface for social scientists where they can use our models easily with their data
  • Create a pull request to add the models to official transformers repo
๐Ÿ‘ The repo is still in active developements. Feel free to create an issue !! ๐Ÿ‘

About

DeEpLearning models for MultIlingual haTespeech (DELIMIT): Benchmarking multilingual models across 9 languages and 16 datasets.

License:MIT License


Languages

Language:Jupyter Notebook 83.0%Language:Python 17.0%