Missing Data Preparation section for the CNN / DailyMail dataset
cronopioelectronico opened this issue · comments
Hi,
in the README file there are instructions to prepare the other datasets, but they are missing for the CNN / DailyMail dataset. Since you are providing the checkpoint for this case, It would be great if you can include the data preparation instructions too.
Thanks.
Thank you for asking! For convenient, we download the cnn/dm
dataset using the Tensorflow/tensor2tensor. Then please try out the commands below to prepare the binary dataset.
#!/bin/bash
TEXT=data/cnn_daily_t2t
TRUNC=1000
fairseq-preprocess --source-lang source --target-lang target \
--trainpref $TEXT/cnndm.train.$TRUNC --validpref $TEXT/cnndm.dev.$TRUNC --testpref $TEXT/cnndm.test.$TRUNC \
--destdir data/binary/cnndm_t2t_30k_$TRUNC \
--workers 20 --joined-dictionary