Tokenization process in LREC version?

Question

Tokenization process in LREC version?

simtony opened this issue 4 years ago · comments

The test set and training set are pre-tokenized and no description about the tokenization process is provided.
Tokenization affects both the performance of off-the-shell parser and BLEU computation.
It would be helpful for rigorous research to supply the tokenization script, or a detokenization, or a un-tokenized version of train&test set.

raganato · Answer 1 · Mon Dec 28 2020 18:53:00 GMT+0800 (China Standard Time)

to detokenize the data, you can use the detokenizer script from the moses project.
Here is the link:
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/detokenizer.perl