jandorn / hate-speech-classification

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Hate Speech Classification / Twitter Sentiment Classification

We developed a tweet hate speech classification pipeline including preprocessing steps, data set splitting, and three methods of classifying a tweet that includes a Naive Bayes, a Support Vector Machine and a Neural Network. Hyperparameter tuning methods were used to maximize the precision of each method. The best and final model was obtained with the SVM.

For the classification, we used a python-notebook (.ipynb). This way we could easily and quickly access various python libraries while maintaining a clear structure. In addition, the format is very suitable for working in a group via Google Colab.

Structure

Jupyter Notebook

  1. Imports & read data: Data for training are uploaded, Python libraries are imported
  2. Basic inspection of dataset
  3. Preprocessing:
    1. „@USER, RT and {{URL}}" are removed
    2. Stemming and Lemmatization
    3. Everything in lowercase
  4. Train/test split: Division into train (80%) and test-set (20%)
  5. Naïve Bayes
    1. Train model incl. hyperparameter tuning
    2. Evaluation (classification_report, ConfusionMatrixDisplay.from_estimator)
    3. Apply to test set
  6. SVM
    1. Train model incl. hyperparameter tuning
    2. Evaluation (classification_report, ConfusionMatrixDisplay.from_estimator)
    3. Apply to test set
  7. Neural Network
    1. Train model incl. hyperparameter tuning
    2. Evaluation (classification_report, ConfusionMatrixDisplay.from_estimator)
    3. Apply to test set
  8. Export results: Export of the classification result with the highest accuracy (in our case, the SVM)

Dataset

train.tsv

The train.tsv file is our dataset used for training. It has a collection of over 18.000 labeled tweets in a convenient tsv format.

test.tsv.dist

The test.tsv.dist file is our evaluation or testing set. These tweets consisting of almost 5000 tweets are not labeled and are to be predicted by our different trained models. Unfortunately we dont have access to the corresponding dataset with the actual labels. We do know, that our SVM labeled them correctly with a success rate of about ~94%.

Authors

This was a group project done as undergraduate students at the Karlsruhe Institute of Technology (KIT) as a bonus for the lecture 'Introduction to Artificial Intelligence' in January of 2022 by Jan Bode and Jan Dorn. Standard MIT License is applied.

About

License:MIT License


Languages

Language:Jupyter Notebook 100.0%