Tokenization process in LREC version?
simtony opened this issue · comments
Tony commented
The test set and training set are pre-tokenized and no description about the tokenization process is provided.
Tokenization affects both the performance of off-the-shell parser and BLEU computation.
It would be helpful for rigorous research to supply the tokenization script, or a detokenization, or a un-tokenized version of train&test set.
raganato commented
to detokenize the data, you can use the detokenizer script from the moses project.
Here is the link:
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/detokenizer.perl