This repository conteins the dataset and the script used to build the AI of Knoxly.
Knoxy is a Google Chrome plug-in that allows you to highlight sensitive data based on dictionaries and regexp.
In this repo I build an IA for Knoxly able to recognize context-dependent sensitive data.
Sensitive data that can identify AI relate to the following topics:
- politics
- health
- job
- travel
- general
- racism
- religion
- sexual orientation
To run this model you need:
Package | Version |
---|---|
pandas | 1.0.1 |
matpotlib | 3.1.3 |
numpy | 1.18.1 |
regex | 2020.2.20 |
sentence-splitter | 1.4 |
sklearn | 0.0 |
scipy | 1.4.1 |
tensorflow | 2.1.0 |
tensorflow-estimator | 2.1.0 |
tensorflow-hub | 0.7.0 |
tensorflow-text | 2.1.1 |
textblob | 0.15.3 |
To identify the following topics, a classifier has been trained to take data from the following datasets.
- Election day tweet
- Medical Transcript
- Medical speech transcript and intent
- Amazon job skills
- Twitter airline sentiment
- The movie dataset The pre-processed dataset used to build the dataset are in dataset folder
to create the dataset to give input to the IA:
# 1. in a new folder extraxct ds.zip
# 2. move the python script named crea.py present in src/create_ds in the new folder
echo text,tipo >> ds1000.csv
python3 crea.py ds1000.csv 1000
echo text,tipo >> ds2000.csv #for wild testing
python3 crea.py ds2000.csv 2000
# move ds1000 & ds2000 in src
python3 src/topic_class.py src/ds1000.csv
To run a sensitiveness classifier you need a new dataset with the label sensibile(1) not sensibile(0). The dataset is present in src and the name of the dataset is ds200.csv
python3 sensitiveness.py