aref-hasan / nlp_project

a NLP project by students at The Baden-Württemberg Cooperative State University (DHBW) Mannheim

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

nlp_project

a NLP project by students at The Baden-Württemberg Cooperative State University (DHBW) Mannheim, Germany.

Personally Identifiable Information (PII) detection in text input using the following dataset: https://huggingface.co/datasets/ai4privacy/pii-masking-200k.

The goal is to detect if a given text contains private information such as names, addresses, phone numbers, passwords, banking information, etc. Users will be able to enter text into an application and receive feedback about the PII in that text.

Team Members:

Franziska Marb, Jannik Völker, Nik Yakovlev, Aref Hasan.

Structure

App

A small demo application can be found in the app folder. There are extra instructions for the execution and usage in the folder.

Playground

All our research and prototypes can be found in the playground folder. There are different Jupyter Notebooks for each of the models we tested for detecting PII. The results can be found in each notebook, or in the paper.

Create venv:

python -m venv .venv

Activate venv

.\.venv\Scripts\activate

Deactivate venv

deactivate

Write packages

pip freeze > requirements.txt

Install packages

pip install -r requirements.txt

Download and prepare data

Most of our Notebooks depend on training / evaluation data to be downloaded from huggingface. This should be possible the the following script. Make sure to be in the venv environment and then run prepare_data.py

.\.venv\Scripts\activate
python prepare_data.py

Use data

Read the dataset

df = pd.read_json("data/dataset_english.json")

About

a NLP project by students at The Baden-Württemberg Cooperative State University (DHBW) Mannheim

License:MIT License


Languages

Language:Python 51.9%Language:C++ 21.9%Language:Cython 11.4%Language:Tcl 5.3%Language:Jupyter Notebook 4.1%Language:CSS 2.7%Language:C 1.8%Language:Jinja 0.5%Language:JavaScript 0.3%Language:Cuda 0.1%Language:HTML 0.1%Language:Smarty 0.0%Language:CMake 0.0%Language:Makefile 0.0%Language:Lua 0.0%