List of works need to implement in Preprocess.py scripts.
lngvietthang opened this issue · comments
Viet Thang Luong commented
preprocess.py
script convert the CNN/DailyMail raw corpus into the standard format.
- Convert raw corpus into the standard format
For example, a story has N sentence in content part and M sentence in highlight part, then the output can be represented in the below format:
@content N
Sentence 1
Sentence 2
...
Sentence N
@highlight M
Highlight 1
Highlight 2
...
Highlight M
Where the all sentence are processed by applying some step:
- Word Segmentation (Separating words and punctuations)
- Replace some abbreviation
- Remove punctuations (', '', ``, [, (, {)
- Remove stop-words
- Stemming and Lemmatization
- Mapping some number to date, phone number or currency