This repository provides solution to Task 6-2: Type of harmfulness of PolEval 2019 challenge. The establishment of this project was guided by one simple mission:
To create a world, where haters ain't gonna hate.
(0) RT @anonymized_account @anonymized_account wszystkiego co najlepsze i najpiΔkniejsze! ππ
(1) @anonymized_account A ja bym to tak ujΔ
Ε: Kto krzyΕΌem wojuje, na krzyΕΌu ginieπ
(2) @anonymized_account @anonymized_account @anonymized_account Sakiewicz, Tobie wazelina oczy zalewa i bredzisz.
βββ README.md <- The top-level README for developers using this project.
β
βββ data
β βββ processed <- The final, canonical data sets for modeling.
β βββ raw <- The original, immutable data dump.
β
βββ logs <- Tensorboard model training logs.
β
βββ models <- Trained and serialized models, model predictions, or model summaries.
β
βββ notebooks <- Jupyter notebooks. Naming convention is a number (for ordering) and
β a short `-` delimited description, e.g. `00-initial-data-exploration`.
β
βββ references <- Data dictionaries, manuals, and all other explanatory materials.
|
βββ poetry.lock <- File to resolve and install all dependencies listed in the
β pyproject.toml file.
βββ pyproject.toml <- File orchestrating the project and its dependencies.
β
βββ thc <- Source code for use in this project.
The project is designed to separate the particular modeling steps into notebooks. Notebook list:
- 00-texts-integrity focuses on getting familiarity with data and examines dataset imbalance. It also generates a presentation of an example input.
- 01-train-valid-split is dedicated to dividing the data set into an appropriately represented training and validation set to avoid consequences of sampling bias like shown in the widely known The Literary Digest Presidential poll.
- 10-distilbert provides DistilBERT experiments setup. The Multilingual Cased DistilBERT model was fine-tuned on a downstream task trained with the use of an AdamW optimizer.
- 11-model-selection shows method of selecting the best model with use of the TensorBoard training logs and prepares test dataset predictions.
If only the thc source package functionalities are of interest then it is enough to run:
pip install git+https://github.com/mrtovsky/thc.git
To interact with the notebooks e.g. rerun them, full project preparation is necessary. It can be done in the following few steps. First of all, you need to clone the repository:
git clone https://github.com/mrtovsky/thc.git
Then, enter this directory and create a .env file that stores environment variable with the cloned repository path:
cd thc/
touch .env
printf "REPOSITORY_PATH=\"$(pwd)\"" >> .env
The recommended way of installing the full project is via Poetry package. If Poetry is not installed already, follow the installation instructions at the provided link. Then, assuming you have already entered the thc directory, resolve and install dependencies using:
poetry install
Furthermore, you may want to attach a kernel with the already created virtual environment to Jupyter Notebook. This can be done by calling:
poetry run python -m ipykernel install --name=thc-venv
This will make thc-venv available in your Jupyter Notebook kernels.
It is also possible to install the package in a traditional way, simply run:
pip install -e .
This will install the package in an editable mode. If you installed it inside of the virtual environment, then attaching it to the Jupyter Notebook kernel is the same as with the Poetry but the command is stripped from the first two elements (remember that the virtualenv needs to be activated beforehand):
python -m ipykernel install --name=thc-venv
Dataset | Micro-F1 | Macro-F1 |
---|---|---|
TRAIN | PLACEHOLDER | PLACEHOLDER |
VALID | PLACEHOLDER | PLACEHOLDER |
TEST | PLACEHOLDER | PLACEHOLDER |
More detailed training results can be displayed by opening the TensorBoard:
tensorboard --logdir ./logs/ --host localhost