mit-han-lab / lite-transformer

[ICLR 2020] Lite Transformer with Long-Short Range Attention

Home Page:https://arxiv.org/abs/2004.11886

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Missing Data Preparation section for the CNN / DailyMail dataset

cronopioelectronico opened this issue · comments

Hi,
in the README file there are instructions to prepare the other datasets, but they are missing for the CNN / DailyMail dataset. Since you are providing the checkpoint for this case, It would be great if you can include the data preparation instructions too.
Thanks.

Thank you for asking! For convenient, we download the cnn/dm dataset using the Tensorflow/tensor2tensor. Then please try out the commands below to prepare the binary dataset.

#!/bin/bash

TEXT=data/cnn_daily_t2t
TRUNC=1000
fairseq-preprocess --source-lang source --target-lang target \
    --trainpref $TEXT/cnndm.train.$TRUNC --validpref $TEXT/cnndm.dev.$TRUNC --testpref $TEXT/cnndm.test.$TRUNC \
    --destdir data/binary/cnndm_t2t_30k_$TRUNC \
    --workers 20 --joined-dictionary