tuetschek / e2e-cleaning

Cleaned E2E NLG Challenge data + supporting scripts

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cleaning Semantic Noise in the E2E dataset

An update release of E2E NLG Challenge data with cleaned MRs and scripts, accompanying the following paper:

Ondřej Dušek, David M. Howcroft, and Verena Rieser (2019): Semantic Noise Matters for Neural Natural Language Generation. In INLG, Tokyo, Japan.

Cleaned data

The fully cleaned E2E NLG Challenge data can be found in cleaned-data. The training and development set are filtered so that they don't overlap the test set, hence the no-ol naming.

The partially cleaned data (see paper) are under partially-cleaned-data. Do not use these unless you have a good reason to do so.

Cleaning process

This is just documenting what we have done to get the cleaned data; you do not need to run this.

1.) Re-annotate MRs in the data (use -t if you want a partial fix only):

./slot_error.py -f train-fixed.csv path/to/trainset.csv
./slot_error.py -f devel-fixed.csv path/to/devset.csv
./slot_error.py -f test-fixed.csv path/to/testset_w_refs.csv

2.) Remove instances with overlapping MRs (after reannotation). Keeps the test set intact; if an instance overlaps between train and dev set, it's removed from the train set:

./remove_overlaps.py train-fixed.csv devel-fixed.csv test-fixed.csv

Experiments with TGen

We used the data with default TGen settings for the E2E Challenge, with validation on the development set (additional training parameter -v input/devel-das.txt,input/devel-text.txt) and evaluation on the test set (both original and cleaned).

To get the plain seq2seq configuration ("TGen-"), we set the classif_filter parameter in the config/config.yaml file to null. To use the slot error script as reranker ("TGen+"), we set classif_filter in the following way:

    classif_filter: {'model': 'e2e_patterns'}

Note that a version of the slot_error.py script is included in TGen code for simpler usage.

System outputs

You can find system outputs of all versions of TGen trained and tested on original and cleaned data under system-outputs. These system outputs were used to obtain the top halves of Table 2 & 3 in the INLG paper.

There are 4 different systems included:

  • SC-LSTM (Wen et al., 2015)
  • TGen-minus – TGen without any reranker
  • TGen-std – TGen with the standard LSTM reranker trained on the same training data
  • TGen-plus – TGen with the rule-based pattern matching reranker used to clean the data (“oracle”) All systems were run 5 times with different random network initialization (run0-run4).

About

Cleaned E2E NLG Challenge data + supporting scripts


Languages

Language:Python 100.0%