Missing Data Preparation section for the CNN / DailyMail dataset

Question

Missing Data Preparation section for the CNN / DailyMail dataset

cronopioelectronico opened this issue 3 years ago · comments

cronopioelectronico commented 3 years ago

Hi,
in the README file there are instructions to prepare the other datasets, but they are missing for the CNN / DailyMail dataset. Since you are providing the checkpoint for this case, It would be great if you can include the data preparation instructions too.
Thanks.

Zhanghao Wu · Answer 1 · Wed Apr 21 2021 21:53:29 GMT+0800 (China Standard Time)

Thank you for asking! For convenient, we download the cnn/dm dataset using the Tensorflow/tensor2tensor. Then please try out the commands below to prepare the binary dataset.

#!/bin/bash

TEXT=data/cnn_daily_t2t
TRUNC=1000
fairseq-preprocess --source-lang source --target-lang target \
    --trainpref $TEXT/cnndm.train.$TRUNC --validpref $TEXT/cnndm.dev.$TRUNC --testpref $TEXT/cnndm.test.$TRUNC \
    --destdir data/binary/cnndm_t2t_30k_$TRUNC \
    --workers 20 --joined-dictionary