Reproducing Fully Supervised Baseline

Question

Reproducing Fully Supervised Baseline

aw31 opened this issue 5 years ago · comments

[Following up on email.]

We attempted to reproduce the fully supervised baseline for Ne->En translation with the commands given on the README. However, we were unable to obtain the BLEU score reported in the paper and are instead getting that the model converges to ~5.6 BLEU (at ~24 ppl). This is in contrast to the reported BLEU of 7.6 from the paper.

Some more details about the training: We trained with the provided code and evaluated using sacrebleu. We first observed this while training on a single K80 GPU. Since our first email, we also reran the model on 4 K80 GPUs (with batch size between 50k and 100k) and still got the same results. (In particular, --fp16 is not available on this hardware.) We are wondering if you are able to reproduce these observations or have any suggestions for resolving this discrepancy.

Thanks!

Alexander Wei · Answer 1 · Mon Apr 08 2019 14:42:16 GMT+0800 (China Standard Time)

If it is helpful, here is a snippet of what the training log looks like around convergence:

| epoch 081 | loss 4.713 | nll_loss 2.123 | ppl 4.36 | wps 18438 | ups 0 | wpb 102175.181 | bsz 7820.583 | num_updates 10092 | lr 0.000629566 | gnorm 0.101 | clip 0.000 | oom 0.000 | wall 4473 | train_wall 33290
| epoch 081 | valid on 'valid' subset | loss 6.846 | nll_loss 4.610 | ppl 24.41 | num_updates 10092 | best_loss 6.84088
| epoch 082 | loss 4.711 | nll_loss 2.119 | ppl 4.34 | wps 18465 | ups 0 | wpb 102175.181 | bsz 7820.583 | num_updates 10164 | lr 0.000627332 | gnorm 0.100 | clip 0.000 | oom 0.000 | wall 4875 | train_wall 33677
| epoch 082 | valid on 'valid' subset | loss 6.860 | nll_loss 4.625 | ppl 24.68 | num_updates 10164 | best_loss 6.84088
| epoch 083 | loss 4.710 | nll_loss 2.118 | ppl 4.34 | wps 18286 | ups 0 | wpb 102175.181 | bsz 7820.583 | num_updates 10236 | lr 0.000625122 | gnorm 0.102 | clip 0.000 | oom 0.000 | wall 5282 | train_wall 34068
| epoch 083 | valid on 'valid' subset | loss 6.868 | nll_loss 4.629 | ppl 24.75 | num_updates 10236 | best_loss 6.84088
| epoch 084 | loss 4.704 | nll_loss 2.111 | ppl 4.32 | wps 18453 | ups 0 | wpb 102175.181 | bsz 7820.583 | num_updates 10308 | lr 0.000622935 | gnorm 0.106 | clip 0.000 | oom 0.000 | wall 5684 | train_wall 34454
| epoch 084 | valid on 'valid' subset | loss 6.854 | nll_loss 4.619 | ppl 24.58 | num_updates 10308 | best_loss 6.84088
| epoch 085 | loss 4.704 | nll_loss 2.111 | ppl 4.32 | wps 18266 | ups 0 | wpb 102175.181 | bsz 7820.583 | num_updates 10380 | lr 0.000620771 | gnorm 0.101 | clip 0.000 | oom 0.000 | wall 6091 | train_wall 34846
| epoch 085 | valid on 'valid' subset | loss 6.836 | nll_loss 4.597 | ppl 24.20 | num_updates 10380 | best_loss 6.8364
| epoch 086 | loss 4.700 | nll_loss 2.106 | ppl 4.31 | wps 18374 | ups 0 | wpb 102175.181 | bsz 7820.583 | num_updates 10452 | lr 0.000618629 | gnorm 0.102 | clip 0.000 | oom 0.000 | wall 6495 | train_wall 35235
| epoch 086 | valid on 'valid' subset | loss 6.856 | nll_loss 4.626 | ppl 24.69 | num_updates 10452 | best_loss 6.84088
| epoch 087 | loss 4.696 | nll_loss 2.101 | ppl 4.29 | wps 18439 | ups 0 | wpb 102175.181 | bsz 7820.583 | num_updates 10524 | lr 0.000616509 | gnorm 0.103 | clip 0.000 | oom 0.000 | wall 6898 | train_wall 35623
| epoch 087 | valid on 'valid' subset | loss 6.865 | nll_loss 4.627 | ppl 24.71 | num_updates 10524 | best_loss 6.84088
| epoch 088 | loss 4.695 | nll_loss 2.100 | ppl 4.29 | wps 18487 | ups 0 | wpb 102175.181 | bsz 7820.583 | num_updates 10596 | lr 0.000614411 | gnorm 0.101 | clip 0.000 | oom 0.000 | wall 7299 | train_wall 36009
| epoch 088 | valid on 'valid' subset | loss 6.854 | nll_loss 4.628 | ppl 24.73 | num_updates 10596 | best_loss 6.84088
| epoch 089 | loss 4.692 | nll_loss 2.096 | ppl 4.28 | wps 18385 | ups 0 | wpb 102175.181 | bsz 7820.583 | num_updates 10668 | lr 0.000612334 | gnorm 0.102 | clip 0.000 | oom 0.000 | wall 7707 | train_wall 36398
| epoch 089 | valid on 'valid' subset | loss 6.855 | nll_loss 4.629 | ppl 24.74 | num_updates 10668 | best_loss 6.84088
| epoch 090 | loss 4.690 | nll_loss 2.094 | ppl 4.27 | wps 18428 | ups 0 | wpb 102175.181 | bsz 7820.583 | num_updates 10740 | lr 0.000610278 | gnorm 0.100 | clip 0.000 | oom 0.000 | wall 8109 | train_wall 36786
| epoch 090 | valid on 'valid' subset | loss 6.878 | nll_loss 4.634 | ppl 24.82 | num_updates 10740 | best_loss 6.84088
| epoch 091 | loss 4.687 | nll_loss 2.089 | ppl 4.26 | wps 18291 | ups 0 | wpb 102175.181 | bsz 7820.583 | num_updates 10812 | lr 0.000608243 | gnorm 0.100 | clip 0.000 | oom 0.000 | wall 8517 | train_wall 37177
| epoch 091 | valid on 'valid' subset | loss 6.832 | nll_loss 4.599 | ppl 24.23 | num_updates 10812 | best_loss 6.83185
| epoch 092 | loss 4.684 | nll_loss 2.086 | ppl 4.25 | wps 18580 | ups 0 | wpb 102175.181 | bsz 7820.583 | num_updates 10884 | lr 0.000606228 | gnorm 0.103 | clip 0.000 | oom 0.000 | wall 8916 | train_wall 37563
| epoch 092 | valid on 'valid' subset | loss 6.837 | nll_loss 4.597 | ppl 24.19 | num_updates 10884 | best_loss 6.83674
| epoch 093 | loss 4.682 | nll_loss 2.084 | ppl 4.24 | wps 18487 | ups 0 | wpb 102175.181 | bsz 7820.583 | num_updates 10956 | lr 0.000604232 | gnorm 0.103 | clip 0.000 | oom 0.000 | wall 9326 | train_wall 37949
| epoch 093 | valid on 'valid' subset | loss 6.847 | nll_loss 4.617 | ppl 24.54 | num_updates 10956 | best_loss 6.83674
| epoch 094 | loss 4.681 | nll_loss 2.083 | ppl 4.24 | wps 18258 | ups 0 | wpb 102175.181 | bsz 7820.583 | num_updates 11028 | lr 0.000602257 | gnorm 0.105 | clip 0.000 | oom 0.000 | wall 9732 | train_wall 38340
| epoch 094 | valid on 'valid' subset | loss 6.837 | nll_loss 4.600 | ppl 24.26 | num_updates 11028 | best_loss 6.83652
| epoch 095 | loss 4.680 | nll_loss 2.082 | ppl 4.23 | wps 18369 | ups 0 | wpb 102175.181 | bsz 7820.583 | num_updates 11100 | lr 0.0006003 | gnorm 0.104 | clip 0.000 | oom 0.000 | wall 10149 | train_wall 38730
| epoch 095 | valid on 'valid' subset | loss 6.862 | nll_loss 4.622 | ppl 24.62 | num_updates 11100 | best_loss 6.83652
| epoch 096 | loss 4.674 | nll_loss 2.073 | ppl 4.21 | wps 18453 | ups 0 | wpb 102175.181 | bsz 7820.583 | num_updates 11172 | lr 0.000598363 | gnorm 0.101 | clip 0.000 | oom 0.000 | wall 10551 | train_wall 39117
| epoch 096 | valid on 'valid' subset | loss 6.822 | nll_loss 4.586 | ppl 24.01 | num_updates 11172 | best_loss 6.82161
| epoch 097 | loss 4.674 | nll_loss 2.074 | ppl 4.21 | wps 18376 | ups 0 | wpb 102175.181 | bsz 7820.583 | num_updates 11244 | lr 0.000596444 | gnorm 0.102 | clip 0.000 | oom 0.000 | wall 10965 | train_wall 39506
| epoch 097 | valid on 'valid' subset | loss 6.867 | nll_loss 4.629 | ppl 24.75 | num_updates 11244 | best_loss 6.82161
| epoch 098 | loss 4.670 | nll_loss 2.069 | ppl 4.19 | wps 18379 | ups 0 | wpb 102175.181 | bsz 7820.583 | num_updates 11316 | lr 0.000594543 | gnorm 0.101 | clip 0.000 | oom 0.000 | wall 11369 | train_wall 39895
| epoch 098 | valid on 'valid' subset | loss 6.851 | nll_loss 4.605 | ppl 24.34 | num_updates 11316 | best_loss 6.82161
| epoch 099 | loss 4.668 | nll_loss 2.066 | ppl 4.19 | wps 18459 | ups 0 | wpb 102175.181 | bsz 7820.583 | num_updates 11388 | lr 0.000592661 | gnorm 0.103 | clip 0.000 | oom 0.000 | wall 11772 | train_wall 40282
| epoch 099 | valid on 'valid' subset | loss 6.832 | nll_loss 4.600 | ppl 24.26 | num_updates 11388 | best_loss 6.82161
| epoch 100 | loss 4.665 | nll_loss 2.063 | ppl 4.18 | wps 18403 | ups 0 | wpb 102175.181 | bsz 7820.583 | num_updates 11460 | lr 0.000590796 | gnorm 0.099 | clip 0.000 | oom 0.000 | wall 12175 | train_wall 40669
| epoch 100 | valid on 'valid' subset | loss 6.844 | nll_loss 4.607 | ppl 24.36 | num_updates 11460 | best_loss 6.82161

Peng-Jen Chen · Answer 2 · Mon Apr 15 2019 22:18:03 GMT+0800 (China Standard Time)

@aw31, sorry for the late reply. I need more information to reproduce it on my side.
Could you provide the fairseq parameters you used? At the first line of training log output of fairseq, there is a full list of parameters (starting with Namespace...).
How did you prepare the training data? Do you use the provided scripts (download-data.sh, prepare-neen.sh) or you process them differently?

Alexander Wei · Answer 3 · Tue Apr 16 2019 14:51:41 GMT+0800 (China Standard Time)

Thanks for your comment. Below is a paste of the full list of parameters. For preparation of data, we used the provided scripts.

Namespace(adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='transformer', attention_dropout=0.2, bucket_cap_mb=25, clip_norm=0.0, cpu=False, criterion='label_smoothed_cross_entropy', curriculum=0, data=['data-bin/wiki_ne_en_bpe5000/'], ddp_backend='no_c10d', decoder_attention_heads=2, decoder_embed_dim=512, decoder_embed_path=None, decoder_ffn_embed_dim=2048, decoder_input_dim=512, decoder_layers=5, decoder_learned_pos=False, decoder_normalize_before=True, decoder_output_dim=512, device_id=2, distributed_backend='nccl', distributed_init_method='tcp://localhost:19790', distributed_port=-1, distributed_rank=2, distributed_world_size=4, dropout=0.4, encoder_attention_heads=2, encoder_embed_dim=512, encoder_embed_path=None, encoder_ffn_embed_dim=2048, encoder_layers=5, encoder_learned_pos=False, encoder_normalize_before=True, fix_batches_to_gpus=False, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_interval_updates=-1, keep_last_epochs=-1, label_smoothing=0.2, lazy_load=False, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=150, lr=[0.001], lr_scheduler='inverse_sqrt', lr_shrink=0.1, max_epoch=150, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=4000, max_update=0, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=1e-09, momentum=0.99, no_epoch_checkpoints=False, no_progress_bar=True, no_save=False, no_token_positional_embeddings=False, num_workers=0, optimizer='adam', optimizer_overrides='{}', raw_text=False, relu_dropout=0.2, required_batch_size_multiple=8, reset_lr_scheduler=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='checkpoints', save_interval=2, save_interval_updates=0, seed=1, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, source_lang='ne', target_lang='en', task='translation', tensorboard_logdir='', threshold_loss_scale=None, train_subset='train', update_freq=[8], upsample_primary=1, user_dir=None, valid_subset='valid', validate_interval=1, warmup_init_lr=1e-07, warmup_updates=4000, weight_decay=0.0001)

Peng-Jen Chen · Answer 4 · Tue Apr 23 2019 13:18:45 GMT+0800 (China Standard Time)

Hi @aw31, I use exactly the same parameter as your provided above to reproduce on my side. I got validation BLEU 5.19, and test BLEU 6.78. Here is the parameter I used, the first and last few epochs of training log:
https://gist.github.com/pipibjc/5014656d9bee25d2beb78b296fd5849b

When you reported ~5.6 BLEU, is it validation BLEU or test BLEU? The 7.6 BLEU we reported is the BLEU score on the test set.

About the gap between 6.78 test BLEU and the reported test BLEU 7.6, now I can only get 7.3 test BLEU from the repository. We use different script to prepare the data when we report the 7.6 test BLEU, and it causes the BLEU score difference.

To get 7.3 test BLEU from 6.78, there are few things:

Set update_freq to 1 if training with 4 gpus (or update_freq to 4 for 1 gpu). This will bring 6.78 test BLEU to 6.88.
Pull the latest change of the repository and rerun the data preparation scripts (we can skip the downloading part). We just found and fixed a bug in our ne-en data preparation. This will bring test BLEU from 6.88 to 7.30. Here is the training log that reproduces 7.30 test BLEU score:
https://gist.github.com/pipibjc/ea04f9a3ac3641c0ca1b619cc365e929

Alexander Wei · Answer 5 · Wed Apr 24 2019 03:45:31 GMT+0800 (China Standard Time)

Hi @pipibjc, we really appreciate your help with this issue. We had previously been reporting BLEU on the validation set. When we ran our model on the test set, we obtained a BLEU of 7.17 (without the updated data preparation scripts).

I am a bit surprised that there is a such a big discrepancy between the validation and test BLEUs---do you have an explanation for this difference? The paper seems to suggest that the sentences in these two datasets should come from the same distribution.

Peng-Jen Chen · Answer 6 · Thu Apr 25 2019 01:36:54 GMT+0800 (China Standard Time)

@aw31, validation and test set here has different distribution.

To clarify the terminology, the valid set here is dev set in the paper, and test set here is devtest set in the paper. test set in the paper is not published yet.
The distribution between devtest and test are the same. In both sets, half of the sentences originates from source language, and the other half originates from target language. But dev set does not keep the balanced ratio, and it has different distribution from devtest set and test set.

More details can be found in section "3.4 Resulting data-sets" in the paper.

Peng-Jen Chen · Answer 7 · Fri May 03 2019 05:08:58 GMT+0800 (China Standard Time)

Closing, but feel free to re-open if there are further issues.