CLEVE: Contrastive Pre-training for Event Extraction

Source code for ACL 2021 paper "CLEVE: Contrastive Pre-training for Event Extraction"

Requirements

transformers == 2.5.0
pytorch == 1.2.0
nltk
tqdm

Overview

Our pipeline contains four parts.

NYT preprocessing
AMR Parsing
Pre-training
Downstream Usage

NYT Preprocessing

Get dataset

Due to the license limitation, we cannot release the New York Times Annotated Corpus used in our pre-training or provide the preprocessed files here. Please download the dataset from here. We use ${NYT_HOME} to denote the path to the downloaded original NYT corpus.

Preprocess

First, we need to prepare a Python 2.7 environment. Then:

git clone https://github.com/notnews/nytimes-corpus-extractor.git
cd nytimes-corpus-extractor
pip install -r requirements.txt
python nytextract.py ${NYT_HOME}/data

Then we will get full texts of the NYT corpus in .txt format in nytimes-corpus-extractor/text/nyt_corpus/data. We use ${NYT_TEXT_HOME} to denote this folder in later sections.

Merge

${NYTTEXTHOME} has plenty of folders and each folder has many .txt files, which is not convinient for later operations. Use

python ${CLEVE_HOME}/AMR/sent_tokenize.py --data_dir ${NYT_TEXT_HOME} --num {NUM}

(This command needs Python 3.6)

${NUM} is the number of sentences in NYT we actully use in our pre-training. 30000 would be enough for our task. This command will take about 4 hours. Then we will get a file nyt_sent_limit.txt. It contains one sentence per line. We use [input_sentence_file] to denote this file.

AMR Parsing

In this section, we will use CAMR to parse the file [input_sentence_file] and JAMR to do alignment. Our goal is to get an AMR file in the format like the following example:

# ::id 1
# ::snt It's as if Carl Lewis were an actor instead of an athlete.
# ::tok It 's as if Carl Lewis were an actor instead of an athlete .
# ::alignments 4-6|0.1.0+0.1.0.0+0.1.0.0.0+0.1.0.0.1 0-1|0.0 8-9|0.1 12-13|0 ::annotator Aligner v.03 ::date 2021-08-29T03:10:23.763
# ::node	0	athletes	12-13
# ::node	0.0	it	0-1
# ::node	0.1	actor	8-9
# ::node	0.1.0	newspaper	4-6
# ::node	0.1.0.0	name	4-6
# ::node	0.1.0.0.0	"Carl"	4-6
# ::node	0.1.0.0.1	"Lewis"	4-6
# ::root	0	athletes
# ::edge	actor	ARG0	newspaper	0.1	0.1.0	
# ::edge	athletes	domain	actor	0	0.1	
# ::edge	athletes	domain	it	0	0.0	
# ::edge	name	op1	"Carl"	0.1.0.0	0.1.0.0.0	
# ::edge	name	op2	"Lewis"	0.1.0.0	0.1.0.0.1	
# ::edge	newspaper	name	name	0.1.0	0.1.0.0	
(x13 / athletes
	:domain (x1 / it)
	:domain (x9 / actor
		:ARG0 (x5 / newspaper
			:name (n / name
				:op1 "Carl"
				:op2 "Lewis"))))

(Other instances....)

If you want to use another AMR parser to get this file, you can skip this section but keep the final file in the same format. We denote this file as [nyt_parsed_file].

CAMR

We still need to use Python 2.7 to run CAMR.

git clone https://github.com/c-amr/camr.git
pip install nltk==3.4.5
cd camr
bash ./scripts/config.sh

nltk version should be not higher than 3.4.5 since 3.4.5 is the latest version supporting Python 2.7. Then please add ssplit.eolonly=true to ${CAMR_HOME}/stanfordnlp/default.properties (Otherwise a bug will occur) and set VERBOSE to False in ${CAMR_HOME}/stanfordnlp/default.properties (Otherwise the speed will be much lower).

CAMR requires JDK 1.8. You can download JDK 1.8 from Oracle and add JDK to you environment variable $PATH.

python amr_parsing.py -m preprocess [input_sentence_file]

For 30000 sentences, this script will execute for about 8 hours. Now we get tokenized sentences (.tok), POS tags and name entities (.prp) and dependency structures (.charniak.parse.dep). Then download model file and uncompress it:

wget http://www.cs.brandeis.edu/~cwang24/files/amr-anno-1.0.train.m.tar.gz
tar zxvf amr-anno-1.0.train.m.tar.gz

Now we can do parsing:

python amr_parsing.py -m parse --model [model_file] [input_sentence_file] 2>log/error.log

Now we get parsed AMR file (.parsed) (denote as[input_amr_file]). Before we do alignment, we need to add tokens to AMR files.

python amr_parsing.py -m preprocess --amrfmt amr [input_amr_file]

Now we get a tokenized AMR file (.amr.tok) (denote as[input_amr_tok_file]). It should be like:

# ::id 1
# ::snt It's as if Carl Lewis were an actor instead of an athlete.
# ::tok It 's as if Carl Lewis were an actor instead of an athlete .
(x13 / athletes
	:domain (x1 / it)
	:domain (x9 / actor
		:ARG0 (x5 / newspaper
			:name (n / name
				:op1 "Carl"
				:op2 "Lewis"))))

(Other instances....)

JAMR

We still need Python 2.7 to run JAMR. To set up JAMR:

git clone https://github.com/jflanigan/jamr.git
git checkout Semeval-2016

JAMR requires sbt == 0.13.18. If you do not have it, you need to install it via:

wget https://github.com/sbt/sbt/releases/download/v0.13.18/sbt-0.13.18.tgz
tar zxvf sbt-0.13.18.tgz

And then add it to your $PATH and use sbt about to check if it is available. Next you could run following commands to set up JAMR:

bash ./setup
bash scripts/config.sh
./compile

Use this command to do alignment:

${JAMR_HOME}/run Aligner -v 0 --print-nodes-and-edges < [input_amr_tok_file] > [nyt_parsed_file]

Pre-training

Dataset

If you are running with ACE 2005, please preprocess format same as this repo. If you are running with MAVEN, nothing needs to be done. Processed data should be stored in ${DATA_HOME} (${ACE_HOME} or ${MAVEN_HOME})

Pre-training

Now switch to Python 3.6. To get contrastive pre-training data, use:

python ${CLEVE_HOME}/AMR/load_AMR.py --amr_file [nyt_parsed_file]

You will get a file contrast_examples.pkl that contains pretraining data. Put it into ${ACE_HOME}. Then use following command to pre-train model:

CUDA_VISIBLE_DEVICES=${GPU_ID} python run_ee.py \
    --data_dir  ${ACE_HOME}\
    --model_type roberta \
    --model_name_or_path roberta-large \
    --task_name ace \
    --output_dir ${MODEL_DUMP_HOME} \
    --max_seq_length 128 \
    --do_lower_case \
    --per_gpu_train_batch_size ${BATCH_SIZE} \
    --per_gpu_eval_batch_size ${BATCH_SIZE} \
    --gradient_accumulation_steps 1 \
    --learning_rate 1e-5 \
    --num_train_epochs 100 \
    --save_steps 50 \
    --logging_steps 50 \
    --seed 233333 \
    --do_train \
    --do_eval \
    --do_test \
    --evaluate_during_training \
    --max_contrast_entity_per_sentence 10 \
    --do_pretrain \

You will get pretained model in ${MODEL_DUMP_HOME}. Please change ${BATCH_SIZE} according to your GPU cards.

Downstream Usage

Supervised Event Extraction

To run event detection:

CUDA_VISIBLE_DEVICES=${GPU_ID} python run_ee.py \
    --data_dir ${DATA_HOME} \
    --model_type roberta \
    --model_name_or_path ${MODEL_DUMP_HOME}/checkpoint-XX \
    --task_name ${TASK_NAME} \
    --output_dir ${ED_MODEL_DUMP_HOME} \
    --max_seq_length 128 \
    --do_lower_case \
    --per_gpu_train_batch_size ${BATCH_SIZE} \
    --per_gpu_eval_batch_size ${BATCH_SIZE} \
    --gradient_accumulation_steps 1 \
    --learning_rate 1e-5 \
    --num_train_epochs 50 \
    --save_steps 500 \
    --logging_steps 50 \
    --seed 233333 \
    --do_train \
    --do_eval \
    --do_test \
    --evaluate_during_training \

${TASK_NAME} could be ace or maven. Please change ${BATCH_SIZE} according to your GPU memory.

After event detection, you will get a pred.json file in ${ED_MODEL_DUMP_HOME}. To run event argument extraction, put this file to ${DATA_HOME} and run:

cd EAE
CUDA_VISIBLE_DEVICES=${GPU_ID} python run_ee.py \
    --data_dir ${ACE_HOME} \
    --model_type roberta \
    --model_name_or_path ${MODEL_DUMP_HOME}/checkpoint-XX \
    --task_name ace_eae \
    --output_dir ${EAE_MODEL_DUMP_HOME} \
    --max_seq_length 128 \
    --do_lower_case \
    --per_gpu_train_batch_size ${BATCH_SIZE} \
    --per_gpu_eval_batch_size ${BATCH_SIZE} \
    --gradient_accumulation_steps 2 \
    --learning_rate 1e-5 \
    --num_train_epochs 50 \
    --save_steps 100 \
    --logging_steps 100 \
    --seed 11 \
    --do_train \
    --do_eval \
    --do_test \
    --evaluate_during_training

The parameters are similar with the event detection part.

Citation

If these codes help you, please cite our paper:

@inproceedings{wang-etal-2021-cleve,
    title = "{CLEVE}: {C}ontrastive {P}re-training for {E}vent {E}xtraction",
    author = "Wang, Ziqi  and Wang, Xiaozhi  and Han, Xu  and Lin, Yankai  and Hou, Lei  and Liu, Zhiyuan  and Li, Peng  and Li, Juanzi  and Zhou, Jie",
    booktitle = "Proceedings of ACL-IJCNLP",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-long.491",
    doi = "10.18653/v1/2021.acl-long.491",
    pages = "6283--6297",
}

davidie / CLEVE

CLEVE: Contrastive Pre-training for Event Extraction

Requirements

Overview

NYT Preprocessing

Get dataset

Preprocess

Merge

AMR Parsing

CAMR

JAMR

Pre-training

Dataset

Pre-training

Downstream Usage

Supervised Event Extraction

Citation

About

Languages