Personally Identifiable Information Data Detection. NLP Course Project

Anna Marshalova, Olga Tikhobaeva, Timur Ionov

Prerequisites

Python 3.10+
26+ GB GPU (we used one A100-SXM4-40GB)
W&B account
Kaggle account

Setup

Accept the rules of the kaggle competition to access the dataset

https://www.kaggle.com/competitions/pii-detection-removal-from-educational-data

Load data

kaggle competitions download -c pii-detection-removal-from-educational-data

Clone repo

git clone https://github.com/sir-timio/NLP_ods_course

Install dependencies

pip install -r requirements.txt

Repo structure

.
├── conf
│   ├── generation_conf.yaml
│   └── prompts
│       └── rewriting_prompt_v1.txt
├── data
│   ├── essay
│   │   ├── mixtral_train.json
│   │   ├── og_train_downsampled.json
│   │   ├── og_train.json
│   │   ├── og_val.json
│   │   ├── orig_train.json
│   └── faker_pii.csv
├── pybooks
│   ├── dataset_logging.ipynb
│   ├── eda.ipynb
│   ├── fill_ner.ipynb
│   └── llm_rewriting.ipynb
├── README.md
├── src
│   ├── dataset
│   │   └── utils.py
│   ├── generation
│   │   ├── fill_ner.py
│   │   ├── llm_rewriting.py
│   │   ├── make_fake_pii.py
│   │   └── utils.py
│   ├── __init__.py
│   ├── metrics.py
│   ├── modeling
│   │   ├── deberta_base.py
│   │   ├── deberta_focal.py
│   │   ├── __init__.py

Usage

Run fake PII generator

python src/generation/make_fake_pii.py

Run LLM text rewriting with PII

pybooks/llm_rewriting.ipynb 
or 
src/generation/llm_rewriting.py

Insert fake data into the generated essays

pybooks/fill_ner.ipynb 
or 
src/generation/fill_ner.py

Configure and fit the model

python train.py

Results

All logged metrics are displayed in the W&B report.

sir-timio / NLP_ods_course

Personally Identifiable Information Data Detection. NLP Course Project

Prerequisites

Setup

Repo structure

Usage

Results

About

Languages