menghuu / graph-2-text

Graph to sequence implemented in Pytorch combining Graph convolutional networks and opennmt-py

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

This is the code used in the paper Deep Graph Convolutional Encoders for Structured Data to Text Generation by Diego Marcheggiani and Laura Perez-Beltrachini.

We extended the OpenNMT library with a Graph Convolutional Network encoder.

Dependencies

Download and prepare data

Download webnlg data from here in data/webnlg/ keeping the three folders for the different partitions.

There is a preparation step for extracting the node and the edges from the graphs. Instruction for this are in the WebNLG scripts point 1.

The preprocessing training and generation steps are the same for the Surface Realization Task (SR11) data.

Preprocess

Using the files obtained in the preparation step, we first generate of data and dictionary for OpenNmt.

To preprocess the raw files run:

python3 preprocess.py -train_src data/webnlg/train-webnlg-all-notdelex-src-nodes.txt \
-train_label data/webnlg/train-webnlg-all-notdelex-src-labels.txt \
-train_node1 data/webnlg/train-webnlg-all-notdelex-src-node1.txt \
-train_node2 data/webnlg/train-webnlg-all-notdelex-src-node2.txt \
-train_tgt data/webnlg/train-webnlg-all-notdelex-tgt.txt \
-valid_src data/webnlg/dev-webnlg-all-notdelex-src-nodes.txt \
-valid_label data/webnlg/dev-webnlg-all-notdelex-src-labels.txt \
-valid_node1 data/webnlg/dev-webnlg-all-notdelex-src-node1.txt \
-valid_node2 data/webnlg/dev-webnlg-all-notdelex-src-node2.txt \
-valid_tgt data/webnlg/dev-webnlg-all-notdelex-tgt.txt \
-save_data data/gcn_exp -src_vocab_size 5000 -tgt_vocab_size 5000 -data_type gcn 

The argument -dynamic_dict is needed to train models using copy mechanism e.g., the model GCN_CE in the paper.

Preprocessing step for the SR11 task are the same as WebNLG.

Embeddings

Using pre-trained embeddings in OpenNMT, need to do this pre-processing step first:

export glove_dir="../vectors"
python3 tools/embeddings_to_torch.py \
    -emb_file "$glove_dir/glove.6B.200d.txt" \
    -dict_file "data/gcn_exp.vocab.pt" \
    -output_file "data/gcn_exp.embeddings" 

Train

After you preprocessed the files you can run the training procedure:


python3 train.py -data data/gcn_exp -save_model data/tmp_ -rnn_size 256 -word_vec_size 256 -layers 1 -epochs 10 -optim adam -learning_rate 0.001 -encoder_type gcn -gcn_num_inputs 256 -gcn_num_units 256 -gcn_in_arcs -gcn_out_arcs -gcn_num_layers 1 -gcn_num_labels 5

To train with a GCN encoder the following options must be set:

-encoder_type
-gcn_num_inputs Input size for the gcn layer -gcn_num_units Output size for the gcn layer -gcn_num_labels Number of labels for the edges of the gcn layer -gcn_num_layers Number of gcn layers -gcn_in_arcs Use incoming edges of the gcn layer -gcn_out_arcs Use outgoing edges of the gcn layer -gcn_residual Decide wich skip connection to use between GCN layers 'residual' or 'dense' default it is set as no resiudal connections -gcn_use_gates Switch to activate edgewise gates -gcn_use_glus Node gates

Add the following arguments to use pre-trained embeddings:

        -pre_word_vecs_enc data/gcn_exp.embeddings.enc.pt \
        -pre_word_vecs_dec data/gcn_exp.embeddings.dec.pt \

Generate

Generating with obtained model:

python3 translate.py -model data/tmp__acc_4.72_ppl_390.39_e1.pt -data_type gcn -src data/webnlg/dev-webnlg-all-delex-src-nodes.txt -tgt data/webnlg/dev-webnlg-all-delex-tgt.txt -src_label data/webnlg/dev-webnlg-all-delex-src-labels.txt -src_node1 data/webnlg/dev-webnlg-all-delex-src-node1.txt -src_node2 data/webnlg/dev-webnlg-all-delex-src-node2.txt -output data/webnlg/delexicalized_predictions_dev.txt -replace_unk -verbose

Postprocessing and Evaluation

For post processing follow step 2 and 3 of WebNLG scripts. For evaluation follow the instruction of the WebNLG challenge baseline or run webnlg_eval_scripts/calculate_bleu_dev.sh .

WebNLG scripts

  1. generate input files for GCN (note WebNLG dataset partitions 'train' and 'dev' are in graph2text/webnlg-baseline/data/webnlg/
cd data/webnlg/
python3 ../../webnlg_eval_scripts/webnlg_gcnonmt_input.py -i ./
python3 ../../webnlg_eval_scripts/webnlg_gcnonmt_input.py -i ./ -p test -c seen #to process test partition

(Make sure the test directory only contains files from the WebNLG dataset, e.g., look out for .DS_Store files.)

If we want to have special arcs in the graph for multi-word named entities then add -e argument. Otherwise the graph will contain a single node, e.g. The_Monument_To_War_Soldiers.

python3 ../../webnlg_eval_scripts/webnlg_gcnonmt_input.py -i ./ -e

To make source and target tokens lowercased, add -l argument. This applies only to notdelex version.

  1. relexicalise output of GCN
cd data/webnlg/
python3 ../../webnlg_eval_scripts/webnlg_gcnonmt_relexicalise.py -i ./ -f delexicalized_predictions_dev.txt

To relexicalise specific partition only, e.g. test add the following argument: -p test

Note: The scripts now read the file 'delex_dict.json' from the same directory of main file (e.g. 'webnlg_gcnonmt_input.py') Note: The sorting of the list of files is added but commented out

python3 ../../webnlg_eval_scripts/webnlg_gcnonmt_relexicalise.py -i ./ -f delexicalized_predictions_test.txt -c seen
  1. metrics (generate files for METEOR and TER)
python3 webnlg_eval_scripts/metrics.py --td data/webnlg/ --pred data/webnlg/rexicalized_predictions.txt --p dev

SR11 scripts

Generate/format input dataset for gcn encoder:

cd srtask/
python3 sr11_onmtgcn_input.py -i ../data/srtask11/SR_release1.0/ -t deep
python3 sr11_onmtgcn_input.py -i ../data/srtask11/SR_release1.0/ -t deep -p test

reanonymise:

python3 sr_onmtgcn_deanonymise.py -i ../data/srtask11/SR_release1.0/ -f ../data/srtask11/SR_release1.0/devel-sr11-deep-anonym-tgt.txt -p devel -t deep

generate/format input dataset for linearised input and sequence encoder:

cd srtask/
python3 sr11_linear_input.py -i ../data/srtask11/SR_release1.0/ -t deep
python3 sr11_linear_input.py -i ../data/srtask11/SR_release1.0/ -t deep -p test

generate TER input files

python3 srtask/srpredictions4ter.py --pred PREDSFILE --gold data/srtask11/SR_release1.0/test/SRTESTB_sents.txt

PREDSFILE is filename with relative path

About

Graph to sequence implemented in Pytorch combining Graph convolutional networks and opennmt-py

License:MIT License


Languages

Language:Python 86.5%Language:Perl 7.0%Language:Emacs Lisp 2.9%Language:Shell 2.5%Language:Smalltalk 0.3%Language:Ruby 0.3%Language:NewLisp 0.3%Language:JavaScript 0.1%Language:Slash 0.0%Language:SystemVerilog 0.0%