facebookresearch / flores

Facebook Low Resource (FLoRes) MT Benchmark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to replicate supervised NE-EN baseline?

j0ma opened this issue · comments

Hi there,

I'm currently trying to reproduce the baseline supervised results from the README but have so far not been able to do so.

So far using the following command

fairseq-generate \
    data-bin/wiki_ne_en_bpe5000/ \
    --source-lang ne --target-lang en \
    --path $CHECKPOINT_DIR \
    --beam 5 --lenpen 1.2 \
    --gen-subset valid \
    --remove-bpe=sentencepiece \
    --sacrebleu

I get the following results:

| Translated 2559 sentences (79390 tokens) in 46.7s (54.85 sentences/s, 1701.52 tokens/s)                                                                                     
| Generate valid with beam=5: BLEU = 6.09 38.4/10.1/3.6/1.4 (BP = 0.917 ratio = 0.920 hyp_len = 42313 ref_len = 45975)

Changing --gen-subset valid to --gen-subset test yields:

| Translated 2835 sentences (94317 tokens) in 57.7s (49.17 sentences/s, 1635.67 tokens/s)
| Generate test with beam=5: BLEU = 7.66 40.2/12.0/4.5/1.9 (BP = 0.958 ratio = 0.959 hyp_len = 48970 ref_len = 51076)

These resulting BLEU scores of 6.09 and 7.66 seems to be different than that reported in Table 3 of the paper. Table 3 reports a BLEU score of 7.6 for devtest.

Training is performed with

CUDA_VISIBLE_DEVICES=0 fairseq-train \
    data-bin/wiki_ne_en_bpe5000/ \
    --source-lang ne --target-lang en \
    --arch transformer --share-all-embeddings \
    --encoder-layers 5 --decoder-layers 5 \
    --encoder-embed-dim 512 --decoder-embed-dim 512 \
    --encoder-ffn-embed-dim 2048 --decoder-ffn-embed-dim 2048 \
    --encoder-attention-heads 2 --decoder-attention-heads 2 \
    --encoder-normalize-before --decoder-normalize-before \
    --dropout 0.4 --attention-dropout 0.2 --relu-dropout 0.2 \
    --weight-decay 0.0001 \
    --label-smoothing 0.2 --criterion label_smoothed_cross_entropy \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0 \
    --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-7 \
    --lr 1e-3 --min-lr 1e-9 \
    --max-tokens 4000 \
    --update-freq 4 \
    --max-epoch 100 --save-interval 1 --save-dir $CHECKPOINT_DIR

This command should be the same as that of the README except for a difference in the checkpointing directory.

My current specs:

# Azure VM:
> Standard NC6_Promo (6 vcpus, 56 GiB memory)

# GPU information:
> sudo lspci -k | grep "NVIDIA"
< d26a:00:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
<	Subsystem: NVIDIA Corporation GK210GL [Tesla K80]

# Fairseq version: 
> pip freeze | grep "fairseq"
< fairseq==0.9.0

I realize some of this may be due to hardware differences, but beyond that I am wondering

  1. Comparing the sentence counts in my evaluation outputs to Table 1 of the paper, it seems like using --gen-subset valid corresponds to using dev (2559 sentences) and --gen-subset test to using devtest (2835 sentences). Is this correct?

    • According to README.md --gen-subset test should be using the test data which, according to Table 1, contains 2924 sentences -- a number which matches neither of my outputs.
  2. What is the version of fairseq the baseline results in the paper were created with?

  3. Could some of this have to do with random initialization of weights/embeddings? Is there a seed I can set to better control this ? If so, was a certain seed used to create the results of the paper?

  4. The closest I get to Table 3 performance is with --gen-subset test, which gives me BLEU 7.66. Since this is only 0.06 away from the reported devtest value, could it effectively be a rounding error?

Apologies for a long post and thank you very much in advance!

Thanks so much for the report and detailed information. It really helps to understand the reproducibility problem better.

Comparing the sentence counts in my evaluation outputs to Table 1 of the paper, it seems like using --gen-subset valid corresponds to using dev (2559 sentences) and --gen-subset test to using devtest (2835 sentences). Is this correct?

Yes, this is correct. In our preparation script, we didn't put the flores test set in the preprocessed data-bin. I will update the text in README to align with the terminology we used in the paper. To clarify, --gen-subset test will generate the number that can be compared with the number we released in the paper (which is flores devtest set).

What is the version of fairseq the baseline results in the paper were created with?

It was based on fairseq 7.2.

Could some of this have to do with random initialization of weights/embeddings? Is there a seed I can set to better control this ? If so, was a certain seed used to create the results of the paper?

We've handled the seed carefully in our released pipeline, including both data preprocessing and training pipeline (including fairseq itself). However, we've observed that even if we trained with the same data and same number of GPUs but on different environments, there are cases that we cannot reproduce the perfect match to score. In multi-GPU cases, the BLEU score can have up to 0.5 point difference in the worst case (in most cases there are 0.1~0.3 point difference).

The closest I get to Table 3 performance is with --gen-subset test, which gives me BLEU 7.66. Since this is only 0.06 away from the reported devtest value, could it effectively be a rounding error?

Yes, we rounded the numbers in the paper.

The README is updated. I am closing this, but feel free to re-open if there is any issue.