huu4ontocord / pii_processing

PII Processing code to clean up BigScience datasets. Reference implementation for the PII Hackathon

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Personally Identifiable Information Processing

This is code for a multi-lingual Named Entity Recognition and PII processor used to remediate PII in web scale large language datasets for training large langauge models. This code is not meant to be used for general purpose PII remediation.

Organization of Repo

  • The repo is a clone of the neuralcoref repo in order to more easily modify the base neuralcoref code and to take advantage of its great organization
  • The code in the ontology folder is for building and tokenizing text for words in the ontology
  • The code in the pii folder includes code to perform NER and coref resolution needed for pii processing
  • The code in the masakhane-ner folder is code for training transfomrer based NER models (BERT, Roberta, etc.)
  • The data folder contains the data needed by the other modules to operate
  • The contrib folder will contain code from the community which will draw from the PII hackathon

PII Hackathon

This repo is also home for reference implementation for the PII Hackathon, run by Ontocord, AISC, and BigScience

  • The code under the directory ontology will be used for Module 1.
  • The code under the directory pii will be used for Module 2.
  • The code under the directory masakhane-ner will be used for Module 3.
  • TODO: Module 4, We will provide a reference implementation for an ensemble semi-supervised learning training of a transformer model

Requirements and Installation

  • pip install spacy==2.1.8
  • git clone https://github.com/ontocord/pii_processing
  • cd pii_processing/
  • python setup.py install
  • python -m nltk.downloader punkt stopwords wordnet
  • python -m spacy download en_core_web_lg

Usage

TODO

Credits

This code is based on original code by Ontocord, LLC (https://github.com/ontocord), Hugginface's Nueralcoref (https://github.com/huggingface/neuralcoref) and MasakhaNER (https://github.com/masakhane-io/masakhane-ner) which is in turn based on HF Transfromers (https://github.com/huggingface/transformers/) and Faker (https://github.com/joke2k/faker) and includes inspiration from Presidio (https://github.com/microsoft/presidio) .

All code is released under Apache 2.0, except Neuralcoref which is under the MIT License.

Data is licensed as specified in the various data folders.

About

PII Processing code to clean up BigScience datasets. Reference implementation for the PII Hackathon

License:Other


Languages

Language:Python 55.7%Language:C 24.4%Language:Perl 15.4%Language:Cython 4.3%Language:Shell 0.2%