xuqiongkai / PILD

Project and Dataset for Personal Information Leakage Detection in Conversations

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PILD

Project and Dataset for Personal Information Leakage Detection (PILD) in Conversations.

Data Pre-processing

mkdir model output
python preprocess_rep.py -lm bert -input ./dataset/persona_linking_test.json -output ./dataset/persona_linking_test.bert
python preprocess_rep.py -lm bert -input ./dataset/persona_linking_train.json -output ./dataset/persona_linking_train.bert

Test Set (to trec format)

python preprocess_trec.py -input ./dataset/persona_linking_test.json -output ./output/persona_linking_test.txt

Model Training

python train.py -epochs 200 -save_model ./model/bert_ -method att_sparse -alpha 0.4 -train_dataset ./dataset/persona_linking_train.bert -dev_dataset ./dataset/persona_linking_dev.bert
python train.py -epochs 200 -save_model ./model/bert_ -method att_sharp -alpha 0.4 -gamma 6.0 -train_dataset ./dataset/persona_linking_train.bert -dev_dataset ./dataset/persona_linking_dev.bert

Model Testing (Note: trec_eval is required.)

python test.py -model ./model/bert_att_sparse_0.01_0.4_E200* -test_dataset ./dataset/persona_linking_test.bert -test_result ./output/bert_att_sparse_0.01_0.4_E200.result
~/Desktop/trec_eval-9.0.7/trec_eval output/persona_linking_test.txt output/bert_att_sparse_0.01_0.4_E200.result -m map -m P.1,2,3,5 -m Rprec -m ndcg
python test.py -model ./model/bert_att_sharp_0.01_0.4_6.0_E200* -test_dataset ./dataset/persona_linking_test.bert -test_result ./output/bert_att_sharp_0.01_0.4_6.0_E200.result
~/Desktop/trec_eval-9.0.7/trec_eval output/persona_linking_test.txt output/bert_att_sharp_0.01_0.4_6.0_E200.result -m map -m P.1,2,3,5 -m Rprec -m ndcg

Citation

@inproceedings{xu-etal-2020-personal,
    title = "Personal Information Leakage Detection in Conversations",
    author = "Xu, Qiongkai and Qu, Lizhen and Gao, Zeyu and Haffari, Gholamreza",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = Nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    pages = "6567--6580",
}

About

Project and Dataset for Personal Information Leakage Detection in Conversations

License:MIT License


Languages

Language:Python 100.0%