wilburOne / ACE_ERE_Scripts

Preprocessing scripts for ACE and ERE datasets

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ACE and ERE Preprocessing

This repository includes the preprocessing scripts for ACE and ERE datasets, including name tagging, entity coreference, relation extraction, event extraction and event coreference tasks. (verified on ACE_2005 and Rich_ERE corpus)

Requirements

Python 3.6, jieba, NLTK

Usage

Step 1

Preprocess Data: remove XML tags from ACE/ERE articles, sentence merging.

python source2rsd.py --source [source_path] --rsd [rsd_path] --data [ace or ere] --extension [ending_of_source_files]

[source_path]: the path for input files (all .sgm files from ACE source corpus)

[rsd_path]: output path

Step 2

Sentence segmentation, tokenization with offset retrieval

python rsd2ltf.py --rsd [rsd_path] --ltf [ltf_path] --extension [ending_of_rsd_files]

[rsd_path]: the path for rsd files from step 1

[ltf_path]: output path

Step 3

Convert ltf files to sentences of tokens as the bio format in name tagging tasks

python ltf2bio.py --ltf [ltf_path] --bio [bio_path]

[ltf_path]: the path for input files

[bio_path]: output path

Step 4

Add annotations to bio files

python bio2ace.py --bio [bio_path] --ann [ann_path] --ace [ace_path]

[bio_path]: the path for input files

[ann_path]: the path for all annotation files from ACE

[ace_path]: output path

Similarly, for ERE corpus,

python bio2ere.py --bio [bio_path] --ann [ann_path] --ere [ace_path]

Citation

[1] Lifu Huang, Taylor Cassidy, Xiaocheng Feng, Heng Ji, Clare R Voss, Jiawei Han, Avirup Sil. Liberal Event Extraction and Event Schema Induction. Proc. ACL'2016

[2] Lifu Huang, Avirup Sil, Heng Ji, Radu Florian. Improving slot filling performance with attentive neural networks on dependency structures. Proc. EMNLP'2017

[3] Lifu Huang, Heng Ji, Kyunghyun Cho, Clare R Voss. Zero-shot transfer learning for event extraction, Proc. ACL, 2018

About

Preprocessing scripts for ACE and ERE datasets


Languages

Language:Python 100.0%