lngvietthang / das

Document Abstractive Summarization

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

List of works need to implement in Preprocess.py scripts.

lngvietthang opened this issue · comments

preprocess.py script convert the CNN/DailyMail raw corpus into the standard format.

  • Convert raw corpus into the standard format

For example, a story has N sentence in content part and M sentence in highlight part, then the output can be represented in the below format:

@content N
Sentence 1
Sentence 2
...
Sentence N
@highlight M
Highlight 1
Highlight 2
...
Highlight M

Where the all sentence are processed by applying some step:

  • Word Segmentation (Separating words and punctuations)
  • Replace some abbreviation
  • Remove punctuations (', '', ``, [, (, {)
  • Remove stop-words
  • Stemming and Lemmatization
  • Mapping some number to date, phone number or currency