Anna Marshalova, Olga Tikhobaeva, Timur Ionov
- Python 3.10+
- 26+ GB GPU (we used one A100-SXM4-40GB)
- W&B account
- Kaggle account
- Accept the rules of the kaggle competition to access the dataset
https://www.kaggle.com/competitions/pii-detection-removal-from-educational-data
- Load data
kaggle competitions download -c pii-detection-removal-from-educational-data
- Clone repo
git clone https://github.com/sir-timio/NLP_ods_course
- Install dependencies
pip install -r requirements.txt
.
├── conf
│ ├── generation_conf.yaml
│ └── prompts
│ └── rewriting_prompt_v1.txt
├── data
│ ├── essay
│ │ ├── mixtral_train.json
│ │ ├── og_train_downsampled.json
│ │ ├── og_train.json
│ │ ├── og_val.json
│ │ ├── orig_train.json
│ └── faker_pii.csv
├── pybooks
│ ├── dataset_logging.ipynb
│ ├── eda.ipynb
│ ├── fill_ner.ipynb
│ └── llm_rewriting.ipynb
├── README.md
├── src
│ ├── dataset
│ │ └── utils.py
│ ├── generation
│ │ ├── fill_ner.py
│ │ ├── llm_rewriting.py
│ │ ├── make_fake_pii.py
│ │ └── utils.py
│ ├── __init__.py
│ ├── metrics.py
│ ├── modeling
│ │ ├── deberta_base.py
│ │ ├── deberta_focal.py
│ │ ├── __init__.py
- Run fake PII generator
python src/generation/make_fake_pii.py
- Run LLM text rewriting with PII
pybooks/llm_rewriting.ipynb
or
src/generation/llm_rewriting.py
- Insert fake data into the generated essays
pybooks/fill_ner.ipynb
or
src/generation/fill_ner.py
- Configure and fit the model
python train.py
All logged metrics are displayed in the W&B report.