mrtovsky / thc

Classify harmful speech in Polish tweets

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

THC

Tweet Harmfulness Classification

Python Code style: black Gitmoji

This repository provides solution to Task 6-2: Type of harmfulness of PolEval 2019 challenge. The establishment of this project was guided by one simple mission:

To create a world, where haters ain't gonna hate.

(0) RT @anonymized_account @anonymized_account wszystkiego co najlepsze i najpiΔ™kniejsze! πŸŽ‰πŸ’
(1) @anonymized_account A ja bym to tak ujΔ…Ε‚: Kto krzyΕΌem wojuje, na krzyΕΌu ginie😁
(2) @anonymized_account @anonymized_account @anonymized_account Sakiewicz, Tobie wazelina oczy zalewa i bredzisz.

Project Organisation

β”œβ”€β”€ README.md          <- The top-level README for developers using this project.
β”‚
β”œβ”€β”€ data
β”‚   β”œβ”€β”€ processed      <- The final, canonical data sets for modeling.
β”‚   └── raw            <- The original, immutable data dump.
β”‚
β”œβ”€β”€ logs               <- Tensorboard model training logs.
β”‚
β”œβ”€β”€ models             <- Trained and serialized models, model predictions, or model summaries.
β”‚
β”œβ”€β”€ notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering) and
β”‚                         a short `-` delimited description, e.g. `00-initial-data-exploration`.
β”‚
β”œβ”€β”€ references         <- Data dictionaries, manuals, and all other explanatory materials.
|
β”œβ”€β”€ poetry.lock        <- File to resolve and install all dependencies listed in the
β”‚                         pyproject.toml file.
β”œβ”€β”€ pyproject.toml     <- File orchestrating the project and its dependencies.
β”‚
β”œβ”€β”€ thc                <- Source code for use in this project.

Notebooks

The project is designed to separate the particular modeling steps into notebooks. Notebook list:

  • 00-texts-integrity focuses on getting familiarity with data and examines dataset imbalance. It also generates a presentation of an example input.
  • 01-train-valid-split is dedicated to dividing the data set into an appropriately represented training and validation set to avoid consequences of sampling bias like shown in the widely known The Literary Digest Presidential poll.
  • 10-distilbert provides DistilBERT experiments setup. The Multilingual Cased DistilBERT model was fine-tuned on a downstream task trained with the use of an AdamW optimizer.
  • 11-model-selection shows method of selecting the best model with use of the TensorBoard training logs and prepares test dataset predictions.

Installation

If only the thc source package functionalities are of interest then it is enough to run:

pip install git+https://github.com/mrtovsky/thc.git

To interact with the notebooks e.g. rerun them, full project preparation is necessary. It can be done in the following few steps. First of all, you need to clone the repository:

git clone https://github.com/mrtovsky/thc.git

Then, enter this directory and create a .env file that stores environment variable with the cloned repository path:

cd thc/
touch .env
printf "REPOSITORY_PATH=\"$(pwd)\"" >> .env

Poetry

The recommended way of installing the full project is via Poetry package. If Poetry is not installed already, follow the installation instructions at the provided link. Then, assuming you have already entered the thc directory, resolve and install dependencies using:

poetry install

Furthermore, you may want to attach a kernel with the already created virtual environment to Jupyter Notebook. This can be done by calling:

poetry run python -m ipykernel install --name=thc-venv

This will make thc-venv available in your Jupyter Notebook kernels.

pip

It is also possible to install the package in a traditional way, simply run:

pip install -e .

This will install the package in an editable mode. If you installed it inside of the virtual environment, then attaching it to the Jupyter Notebook kernel is the same as with the Poetry but the command is stripped from the first two elements (remember that the virtualenv needs to be activated beforehand):

python -m ipykernel install --name=thc-venv

Results

Dataset Micro-F1 Macro-F1
TRAIN PLACEHOLDER PLACEHOLDER
VALID PLACEHOLDER PLACEHOLDER
TEST PLACEHOLDER PLACEHOLDER

More detailed training results can be displayed by opening the TensorBoard:

tensorboard --logdir ./logs/ --host localhost

About

Classify harmful speech in Polish tweets


Languages

Language:Jupyter Notebook 97.1%Language:Python 2.9%