Position-Aware Self-Attention for Relation Extraction

WORK IN PROGRESS! Ideas, bug-fixes and constructive criticism are all welcome.

This project is the result of my Master's Thesis (supervised by Dr. Benjamin Roth):

"Relation extraction using deep neural networks and self-attention"
The Center for Information and Language Processing (CIS)
Ludwig Maximilian University of Munich
Ivan Bilan

The pre-print is available on arXiv (in collaboration with Dr. Benjamin Roth):

https://arxiv.org/abs/1807.03052

Related presentation from PyData Berlin 2018:

Understanding and Applying Self-Attention for NLP - Ivan Bilan

Requirements

Python 3.5+
PyTorch 1.0
CUDA 10.0 (or 9.0+)
CuDNN 7.4 (or 7.1+)

How to setup

1. Python Environment

To automatically create a conda environment (using Anaconda3) with Python 3.7 and Pytorch 1.0dev, run the following command:

make build_venv

Note: you have to have CUDA installed already before creating the environment.

2. Dataset

The TACRED dataset used for evaluation is currently not publicly available. Follow the original authors' GitHub page for more updates: https://github.com/yuhaozhang/tacred-relation

On this page a sample dataset is available at: https://github.com/yuhaozhang/tacred-relation/tree/master/dataset/tacred

For this implementation, we use the JSON format of the dataset which can be generated with the JSON generations script included in the dataset.

3. Vocabulary preparation

First, download and unzip GloVe vectors from the Stanford website, with:

chmod +x download.sh; ./download.sh

Then prepare vocabulary and initial word vectors with:

python prepare_vocab.py dataset/tacred dataset/vocab --glove_dir dataset/glove

This will write vocabulary and word vectors as a numpy matrix into the dir dataset/vocab.

Project Usage

1. Training

Train our final model with:

python runner.py --data_dir dataset/tacred --vocab_dir dataset/vocab --id 00 
--info "Position-aware attention model with self-attention encoder"

Use --topn N to fine-tune the top N word vectors only. The script will do the preprocessing automatically (word dropout, entity masking, etc.).

To train a self-attention encoder model only use:

python runner.py --data_dir dataset/tacred --vocab_dir dataset/vocab --no-attn --id 01 --info "self-attention model"

To combine a self-attention encoder model, LSTM and position-aware layer use:

python runner.py --data_dir dataset/tacred --vocab_dir dataset/vocab --self_att_and_rnn --id 01 --info "combined model"

To train the LSTM only baseline mode, use:

python runner.py --data_dir dataset/tacred --vocab_dir dataset/vocab --no_self_att --no-attn --id 01 --info "baseline model"

To use absolute positional encodings in self-attention instead of relative ones, use:

python runner.py --data_dir dataset/tacred --vocab_dir dataset/vocab --no_diagonal_positional_attention --id 01 
--info "no relative pos encodings"

Model checkpoints and logs will be saved to ./saved_models/00.

2. Evaluation

Run evaluation on the test set with:

python eval.py --model_dir saved_models/00

This will use the best_model.pt by default. Use --model checkpoint_epoch_10.pt to specify a model checkpoint file. Add --out saved_models/out/test1.pkl to write model probability output to files (for ensemble, etc.). In our evaluation runs, we always evaluate the last epoch checkpoint, namely --model checkpoint_epoch_60.pt using:

python eval.py --model_dir saved_models/00 --model checkpoint_epoch_60.pt

3. Ensemble Training

In order to run the ensembled model use:

bash ensemble.sh

Best results

Results comparison on evaluation set (single model):

Evaluation Metric	Our approach	Zhang et al. 2017
Precision (micro)	65.4%	65.7%
Recall (micro)	68.0%	64.5%
F1 (micro)	66.7%	65.1%

Per-relation statistics (single model):

org:alternate_names                  P:  74.78%  R:  80.75%  F1:  77.65%  #: 213
org:city_of_headquarters             P:  71.59%  R:  76.83%  F1:  74.12%  #: 82
org:country_of_headquarters          P:  55.70%  R:  40.74%  F1:  47.06%  #: 108
org:dissolved                        P: 100.00%  R:   0.00%  F1:   0.00%  #: 2
org:founded                          P:  84.21%  R:  86.49%  F1:  85.33%  #: 37
org:founded_by                       P:  72.22%  R:  38.24%  F1:  50.00%  #: 68
org:member_of                        P: 100.00%  R:   0.00%  F1:   0.00%  #: 18
org:members                          P:   0.00%  R:   0.00%  F1:   0.00%  #: 31
org:number_of_employees/members      P:  65.22%  R:  78.95%  F1:  71.43%  #: 19
org:parents                          P:  40.00%  R:  19.35%  F1:  26.09%  #: 62
org:political/religious_affiliation  P:  25.81%  R:  80.00%  F1:  39.02%  #: 10
org:shareholders                     P:  75.00%  R:  23.08%  F1:  35.29%  #: 13
org:stateorprovince_of_headquarters  P:  64.18%  R:  84.31%  F1:  72.88%  #: 51
org:subsidiaries                     P:  55.17%  R:  36.36%  F1:  43.84%  #: 44
org:top_members/employees            P:  66.44%  R:  84.68%  F1:  74.46%  #: 346
org:website                          P:  53.33%  R:  92.31%  F1:  67.61%  #: 26
per:age                              P:  78.06%  R:  92.50%  F1:  84.67%  #: 200
per:alternate_names                  P:   0.00%  R:   0.00%  F1:   0.00%  #: 11
per:cause_of_death                   P:  63.64%  R:  40.38%  F1:  49.41%  #: 52
per:charges                          P:  66.91%  R:  90.29%  F1:  76.86%  #: 103
per:children                         P:  38.30%  R:  48.65%  F1:  42.86%  #: 37
per:cities_of_residence              P:  52.91%  R:  62.43%  F1:  57.28%  #: 189
per:city_of_birth                    P:  50.00%  R:  20.00%  F1:  28.57%  #: 5
per:city_of_death                    P: 100.00%  R:  21.43%  F1:  35.29%  #: 28
per:countries_of_residence           P:  50.00%  R:  55.41%  F1:  52.56%  #: 148
per:country_of_birth                 P: 100.00%  R:   0.00%  F1:   0.00%  #: 5
per:country_of_death                 P: 100.00%  R:   0.00%  F1:   0.00%  #: 9
per:date_of_birth                    P:  77.78%  R:  77.78%  F1:  77.78%  #: 9
per:date_of_death                    P:  62.16%  R:  42.59%  F1:  50.55%  #: 54
per:employee_of                      P:  64.34%  R:  69.70%  F1:  66.91%  #: 264
per:origin                           P:  68.81%  R:  56.82%  F1:  62.24%  #: 132
per:other_family                     P:  59.09%  R:  43.33%  F1:  50.00%  #: 60
per:parents                          P:  58.82%  R:  56.82%  F1:  57.80%  #: 88
per:religion                         P:  44.16%  R:  72.34%  F1:  54.84%  #: 47
per:schools_attended                 P:  64.29%  R:  60.00%  F1:  62.07%  #: 30
per:siblings                         P:  61.29%  R:  69.09%  F1:  64.96%  #: 55
per:spouse                           P:  56.58%  R:  65.15%  F1:  60.56%  #: 66
per:stateorprovince_of_birth         P:  40.00%  R:  50.00%  F1:  44.44%  #: 8
per:stateorprovince_of_death         P:  80.00%  R:  28.57%  F1:  42.11%  #: 14
per:stateorprovinces_of_residence    P:  65.28%  R:  58.02%  F1:  61.44%  #: 81
per:title                            P:  77.13%  R:  87.00%  F1:  81.77%  #: 500

WARNING: Some users are not able to reproduce the results with newer PyTorch versions. At the moment of the pre-print we used PyTorch 0.4.1 to get the results. Currently the project might require significant changes or updates when using newer PyTorch version to achieve the previously reported results. If you happend to find the cause of the performance degradation, feel free to contribute to the project.

Overview of Available Hyperparameters

General Hyperparameters
Argument Name	Default Value	Description
`--emb_dim`	`300`	Word embeddings dimension size
`--word_dropout`	`0.06`	The rate at which we randomly set a word to `UNK`
`--lower / --no-lower`	`True`	Lowercase all words
`--weight_no_rel`	`1.0`	Weight for `no_relation` class
`--weight_rest`	`1.0`	Weight for other classes but `no_relation`
`--lr`	`0.1`	Learning rate (Applies to SGD and Adagrad only)
`--lr_decay`	`0.9`	Learning rate decay
`--decay_epoch`	`15`	Start learning rate decay from given epoch
`--max_grad_norm`	`1.0`	Gradient clipping value
`--optim`	`sgd`	Optimizer, available options: `sgd, asgd, adagrad, adam, nadam, noopt_adam, openai_adam, adamax`
`--num_epoch`	`70`	Number of epochs
`--batch_size`	`50`	Batch size
`--topn`	`1e10`	Only fine-tune top N embeddings
`--log_step`	`400`	Print log every k steps
`--log`	`logs.txt`	Write training log to specified file
`--save_epoch`	`1`	Save model checkpoints every k epochs
`--save_dir`	`./saved_models`	Root dir for saving models
Position-aware Attention Layer
`--ner_dim`	`30`	NER embedding dimension
`--pos_dim`	`30`	POS embedding dimension
`--pe_dim`	`30`	Position encoding dimension in the attention layer
`--attn_dim`	`200`	Attention size in the attention layer
`--query_size_attn`	`360`	Embedding for query size in the positional attention
`--attn / --no-attn`	`True`	Use the position-aware attention layer
Position-aware Attention LSTM Layer
`--hidden_dim`	`360`	LSTM hidden state size
`--num_layers`	`2`	Number of LSTM layers
`--lstm_dropout`	`0.5`	LSTM dropout rate
`--self_att_and_rnn / --no_self_att_and_rnn`	`False`	Use LSTM layer with the Self-attention layer
Self-attention
`--num_layers_encoder`	`1`	Number of self-attention encoders
`--n_head`	`3`	Number of self-attention heads
`--dropout`	`0.4`	Input and attention dropout rate
`--hidden_self`	`130`	Encoder layer width
`--scaled_dropout`	`0.1`	`ScaledDotProduct` Attention dropout
`--temper_value`	`0.5`	Temper value for `ScaledDotProduct` Attention
`--use_batch_norm`	`True`	Use BatchNorm in Self-attention
`--use_layer_norm`	`False`	Use LayerNorm in Self-attention
`--new_residual`	`True`	Use a different residual connection structure than in the original Self-attention
`--old_residual`	`False`	Use the original residual connections in Self-attention
`--obj_sub_pos`	`True`	In self-attention add object and subject positional vectors
`--relative_positions / --no_relative_positions`	`True`	Bin the relative positional encodings
`--diagonal_positional_attention / --no_diagonal_positional_attention`	`True`	Use relative positional encodings as described in our paper
`--self-attn / --no_self_att`	`True`	Use the Self-attention encoder
Lemmatize input
`--use_lemmas / no_lemmas`	`False`	Instead of raw text, use spaCy to lemmatize the sentences
`--preload_lemmas / --no_preload_lemmas`	`False`	Preload lemmatized input as pickles

Attention Example

Sample Sentence from TACRED:

They cited the case of Agency for International Development (OBJECT) subcontractor Alan Gross (SUBJECT), who was working in Cuba on a tourist visa and possessed satellite communications equipment, who has been held in a maximum security prison since his arrest Dec 3.

Attention distribution for the preposition of in the sentence above:

Acknowledgement

The self-attention implementation in this project is mostly taken from (all modifications are explained in the paper linked above): Attention is all you need: A Pytorch Implementation (Related code licensed under MIT License).

The original TACRED implementation is used as a base of this implementation (all modifications are explained in the paper linked above): Position-aware Attention RNN Model for Relation Extraction (Related code licensed under Apache License, Version 2.0).

License

All original code in this project is licensed under the Apache License, Version 2.0. See the included LICENSE file.

TODOs

Improve and document attention visualization process
Add weighting functions as hyperparameter
Add tests
Currently the project is hard-coded to work on a GPU, add CPU support
Do more experiments with the Adam optimizer (i.e. lr=0.0001)

ivan-bilan / Relation-Extraction-Transformer