Not regular training : BLEU is brutally decreasing for several batches, before increasing again

Question

Not regular training : BLEU is brutally decreasing for several batches, before increasing again

EtienneAb3d opened this issue 2 years ago · comments

Hi!

What I'm doing is a bit particular: translating noisy text to denoised clean text (for NeuroSpell auto-corrector). On the principle, it should not cause the strange thing described below. It works perfectly with ModernMT for several years. I'm now experimenting with Marian.

I'm using:

lr-decay-inv-sqrt 5000
shuffle data

On one training (not all), the BLEU is brutally decreasing for several batches, while the Cost is increasing in the same kind of way. It then increases again.

If the data are shuffled, I do not see a reason why a lot of bad sentence pairs would be concentrated on a sequence of several batches. And, the first Epoch is OK. This occurs only on the second Epoch (and certainly also on the third one).

This is reproducible using the same Seed parameter (still not tested with an other value).

Did you see a reason why this could occur?

Training curves (due to the task achieved, it's normal to get a high BLEU):

PS: updated graph

Etienne Monneret · Answer 1 · Mon Nov 07 2022 14:21:10 GMT+0800 (China Standard Time)

Remark: previous post's graph updated for a better comparison with the one below.

I don't have absolute certainty, but I may have an explanation, and a solution. It can also be of interest for common bilingual translation tasks.

In this mono-lingual denoising task, target sentences may be present in many copies in the training corpus:

some initial corpus may be redundant. For the pertinence of the learned statistics, it's important to keep this redundancy.
some initial corpus are duplicated many times, to get in-domain specialization.
each initial sentence occurs as several target sentences having artificially noised different source copies.

By default, Marian is sorting maxi-batches according to target sentences, thus maxi-batch-sort trg. This may cause several mini-batches to be filled with a single same target multi-occurring sentence, or a very few number of different sentences. I think this is producing over-fitting on these mini-batches, destroying the whole model quality in a very short time.

To solve this, ideally, sentences should be shuffled without any sorting, thus with option maxi-batch-sort none. But, this is causing a 3x slower training. Quite prohibitive.

A half-solution is to use maxi-batch-sort src. As same target sentences have different source noisy sentences, it will ensure a minimum diversity in each mini-batch, avoiding too much identical target sentences to be grouped together.

Doing this (with the same Seed, thus the same sentence shuffled order), I get a little faster training, without brutally decreasing BLEU:

Toms Bergmanis · Answer 2 · Wed Feb 08 2023 21:26:55 GMT+0800 (China Standard Time)

An unrelated question - what do you use to generate the graphs?