XingxingZhang / cnndm_acl18

Code to obtain the training data for the ACL 2018 paper "Neural Document Summarization by Jointly Learning to Score and Select Sentences"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data processing for NeuSum

This repo contains the code which can generate the training data (CNN / Daily Mail) needed by NeuSum.

  1. Preprocess CNN/DM dataset using abisee's scripts:

  2. Convert its output to the format shown in the sample_data folder. The format of files:

    1. File train.txt.src is the input document. Each line contains several tokenized sentences delimited by ##SENT## of a document.
    2. File train.txt.tgt is the summary of document. Each line contains several tokenized summaries delimited by ##SENT## of the corresponding document.
  3. Use to search the best sentences to be extracted. The arguments of the main functions are: document_file, summary_file and output_path.

  4. Next, build the ROUGE score gain file using The usage is shown in the code entry.


The algorithm is a brute-force search, which can be slow in some cases. Therefore, running it in parallel is recommended (and it is what I did in my experiments).


Code to obtain the training data for the ACL 2018 paper "Neural Document Summarization by Jointly Learning to Score and Select Sentences"


Language:Python 100.0%