jchonc / deid

de-identification

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

A NLP based PHI de-identification method

Vocabulary & Abbreviations

Core challenge for Lucas

Often we have to test/experiment/research on the free text data from our clients, but we have to erase all the information which could bge used to trace back to the individual. For example, the following text:

Mr. James Bond has visited us at 12/12/2018 at 3:00PM for this routine doctor's appointment. Dr. Ethan Hunt has noted his left hand has some rash.

Obviously both names need to be erased to prevent revealing too much information. But there more, all identification data need to be removed. According to HIPAA, we have to remove:

Names

  • Geographic subdivisions smaller than a state
  • All elements of dates (except year) related to an individual (including admission and discharge dates, birthdate, date of death, all ages over 89 years old, and elements of dates (including year) that are indicative of age)
  • Telephone, cellphone, and fax numbers
  • Email addresses
  • IP addresses
  • Social Security numbers
  • Medical record numbers
  • Health plan beneficiary numbers
  • Device identifiers and serial numbers
  • Certificate/license numbers
  • Account numbers
  • Vehicle identifiers and serial numbers including license plates
  • Website URLs
  • Full face photos and comparable images
  • Biometric identifiers (including finger and voice prints)
  • Any unique identifying numbers, characteristics or codes

What we knew already

From the earlier project we have a limited way to parse/tag various part of the sentences.

What we want you to deliver

Pre-requisite

Get Going

from within code, run "pip install -r requirements.txt"

About

de-identification


Languages

Language:Python 99.4%Language:Makefile 0.6%