200k Sinhala parallel sentences are filtered

Question

200k Sinhala parallel sentences are filtered

xiamengzhou opened this issue 5 years ago · comments

We attempted to reproduce the results of the Sinhala-English pair. Using the data process and training scripts provided in the repo, we found that 1) 200k parallel sentences are filtered with the TRAIN_MINLEN=6; 2) the bleu score is 6.56, around 1.2 lower than that is claimed in the paper. I am wondering is this the correct way to reproduce the result? Will it be possible that we shouldn't filter sentences with minilens?

dannigt · Answer 1 · Wed Mar 04 2020 22:43:13 GMT+0800 (China Standard Time)

Hi Mengzhou, you are referring to the fact that wiki_si_en_bpe5000/train.en has 761K sentences, while wiki_si_en_bpe5000/train.bpe.en only has 579K, right? I used the same Transformer architecture as the authors (although not based on fairseq) and got similar si --> en BLEU as in the paper. So I'd guess filtering out those 200K sentences is not reason for the BLEU difference you encountered. Probably the default hyperparameters provided in the training script were not the ones producing the 7.2 BLEU in the paper.

Danni

Bailin · Answer 2 · Sun Feb 28 2021 18:16:24 GMT+0800 (China Standard Time)

@dannigt Somehow my train.en has 657k sentences whereas train.bpe.en only has 406k. From Table2, I guess we should expect it to be 601k + 46k parallel sentences? Am I missing something?

( by the way, I only obtain the BLEU score of 4.91, trying to figure out the reason)