pseudonymization named-entity-recognition

Pseudo App

This demo app is part of the document's pseudonymization effort lead at Etalab's Lab IA. Other Lab IA projects can be found at the Lab IA.

Project Status: [Active]

Intro/Objectives

The purpose of this repo is to provide a quick demo to the pseudonymization tool we developped. The larger goal of the pseudonymization project is to help France's Conseil d'État open their Justice decisions to the general public, as required by the law. More info about pseudonymization and this project can be found in our French pseudonymization guide here. Behind this web site, there is an API that does the job of text tagging and pseudonymization.

Methods Used

Natural Language Processing: Information Extraction : Named Entity Recognition
Natural Language Processing: Language Modelling / Feature Learning: Word embeddings
Machine Learning: Deep Learning: Recurrent Networks: BiLSTM+CRF

Technologies

Python
Flair, sacremoses
Dash
SQLite
Pandas

Demo Description

The demo consists in four tabs:

Introduction of the project: a brief insight into our pseudonymisation project,
Upload of a document to be pseudonymized: allows for an imageless .doc, .docx, or .txt file to be uploaded (up to 100 kB)
Comparison of volume of training data vs annotation performance: we try to answer the question how much data do I need to get decent results?
API Stats: the use stats of the API that actually does the work.

This demo depends by default on the pseudo API. The API is automatically pulled from its repo in the docker-compose file.

You do need to train a NER model with the Flair library. Unfortunately, we cannot share nor the model nor the data it was trained on as it contains non-public information.

Getting Started

The easiest way to run this application is by using Docker and Docker Compose.

Clone this repo (for help see this tutorial).
Create a .env file in the repo folder and indicates there the path of the local model to the .env file (variable : PSEUDO_MODEL_PATH) + the path of the API database (variable : PSEUDO_API_DB_PATH) + the url of the API (variable : PSEUDO_REST_API_URL). Note that you could also pass this env var to the app directly and you would not need run the API.
Launch the wrapper bash file run_docker.sh. This file will clean and rebuild the required Docker containers by calling docker-compose.yml.
Go to localhost/pseudo/

Project Deliverables

Contact

Feel free to contact @pedevineau or @psorianom or other Lab IA team members with any questions or if you are interested in contributing!

About

Etalab's Lab IA Pseudonymization Demo source code

pseudonymization named-entity-recognition

MIT License

Languages

Language:CSS 58.3%Language:Python 40.6%Language:Dockerfile 0.9%Language:Shell 0.3%