Helsinki-NLP / MuCoW

Automatically harvested multilingual contrastive word sense disambiguation test sets for machine translation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Tokenization process in LREC version?

simtony opened this issue · comments

commented

The test set and training set are pre-tokenized and no description about the tokenization process is provided.
Tokenization affects both the performance of off-the-shell parser and BLEU computation.
It would be helpful for rigorous research to supply the tokenization script, or a detokenization, or a un-tokenized version of train&test set.

to detokenize the data, you can use the detokenizer script from the moses project.
Here is the link:
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/detokenizer.perl