Wiki Personal Attacks Data Exploration

This repository contains code and basic documentation for an independent study course I am taking at FGCU with Dr. Koufakou in the field of natural language processing.

This repository includes the actual code used to run the various experiments. The data is NOT included within the repository, as it is too large for Github to allow.

However, the data can be accessed by contacting Dr. Koufakou at FGCU or myself. It is also available on the public internet, but I am not sure where.

The dataset used is a small subset of the Wikimedia personal attacks dataset that I obtained from Jason Scott, another student researching at FGCU on the same dataset with Dr. Koufakou.

Given more time, it is probably worth using the full dataset, but this could be a very time consuming process.

What is here

.idea/
    - This is the Pycharm project directory - allows this project to be opened in PyCharm.
results/
    - This directory will be empty initially.
      This is where the results of each experiment will ultimately be output.
data/
    - This is where you put your data
    - It also contains Jason's preprocessing script.
    - This file is only needed if you intend to preprocess the data yourself.
binary_f1/
    - Contains all of the experiements which output binary f1 scores.
    - All of these experiements can be run individually.
    - This currently includes birdirectional_lstm and gru, and the standard lstm/gru.
macro_f1/
    - Contains all of the experiements which output macro f1 scores.
    - All of these experiments can be run individually.
    - This also includes the Fasttext experiements since they output macro f1 scores.
data_prep.py
    - This script extracts some commonly used functions to a utilitiy file for reuse.
metrics.py
    - This is a wrapper class to handle gathering metrics for the Macro f1 experiments.
requirements.txt
    - This file contains all of the dependencies that are needed by the project.
    - The can be installed directly by passing this file to Pip as an input.
    - It was autogenerated by the pip freeze command.

How to use these experiments?

Ensure that the data directory contains the needed data in the proper format, then just run whatever script you wish. The results will be output to files in the results directory of this project, depending which specific experiment it was. They each output accuracy, precision, recall, and either binary f1 or macro f1, depending on the specific script you run. They are organized into folders depending on which type of F1 score they report. The results are output in CSV format for easier analysis.

The Fasttext classifier should be very fast, and produce most results within less than a minute on a mid-range system.

To run the fasttext embeddings trained on English Wikipedia, you will need to download them directly from the Fasttext website and point the script at them by changing the file path within the script.

These scripts have been cleaned up to be easy to understand, tweak, and run.

Additionally, performance can be improved if you have a GPU by ensuring that you have the correct CUDA drivers properly installed, including libcublas and cud-nn. The internet has some decent guides available on how to do so. If these drivers are configured properly, the Tensorflow backend should automatically take advantage of your GPU. Depending on your specific system and how it is configured, you may already have the necessary drivers installed.

A guide on how to do so is located at https://bit.ly/2PAMH8Q

This is probably the most difficult part of running these experiments, but if you follow the instructions closely it should work. Be wary of installing too-new of a CUDA version, as in my case I had trouble with this. The most important thing is that you have a version of CUDA that is compatible with your GPU and with the specific version of Tensorflow you have installed.

My system had CUDA 10.1 and NVIDIA driver 418.39, but you may need something different depending on your exact GPU and driver versions.

Some of these experiments take quite some time to run depending on the specific hyperparameters chosen, but in general reducing the pad length is the easiest way to make them run faster for testing purposes. On my GTX 1060 6GB, all of these experiments finished in less than 48 hours, on average giving 10 minutes per epoch on the standard-lstm/gru model.

All experiments within this repository are run with 5-fold cross-validation using stratified k-fold sampling.

The experiments were run on a system running Ubuntu 18.04, but should run on most Linux distributions as long as the proper dependencies are installed.

maillouxc / wikimedia-data-exploration

Wiki Personal Attacks Data Exploration

What is here

How to use these experiments?

About

Languages