200k Sinhala parallel sentences are filtered
xiamengzhou opened this issue · comments
We attempted to reproduce the results of the Sinhala-English pair. Using the data process and training scripts provided in the repo, we found that 1) 200k parallel sentences are filtered with the TRAIN_MINLEN=6; 2) the bleu score is 6.56, around 1.2 lower than that is claimed in the paper. I am wondering is this the correct way to reproduce the result? Will it be possible that we shouldn't filter sentences with minilens?
Hi Mengzhou, you are referring to the fact that wiki_si_en_bpe5000/train.en
has 761K sentences, while wiki_si_en_bpe5000/train.bpe.en
only has 579K, right? I used the same Transformer architecture as the authors (although not based on fairseq) and got similar si --> en BLEU as in the paper. So I'd guess filtering out those 200K sentences is not reason for the BLEU difference you encountered. Probably the default hyperparameters provided in the training script were not the ones producing the 7.2 BLEU in the paper.
Danni