google-research / electra

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Loss of base and large models

stefan-it opened this issue · comments


I'm currently working on a new non-English ELECTRA model. Training on GPU seems to work and is running fine 🤗

Next steps would be to try model training on a TPU, so I would just like to ask if you can post the final loss of both base and large models (or even share the loss training curve) so that we have a kind of reference point when training own models 🤔

Thanks many in advance,


Hi Stefan,

Great to see non-English models being trained already! Here are some pre-training curves (x-axis is pre-train steps) for the models. Note that these metrics were computed with "do_eval": true which means there is no dropout, so training losses will be slightly higher.

Small model on OpenWebText, batch_size=128, mask_percent=0.15

Base model on WikiBooks, batch_size=256, mask_percent=0.15

Large model on XLNet data, batch_size=2048, mask_percent=0.25

The losses will vary depending on your setting because the disciminator loss depends on the quality of the generator, the generator loss can depend on the training data., etc.

Hi @clarkkev,

thanks for these interesting insights! Great to have these training curves 🤗

@clarkkev Do you have any idea about why there's a peak in disc_loss at the very beginning of the training curve?


@yaolu From our own experiments, the generator starts out creating random predictions, then plateaus for a bit using the median prediction. So every [MASK] gets replaced with the token the or the . token. This makes the discriminator's job very easy for a bit. Eventually the generator moves past this plateau and starts generating more diverse predictions, which causes the discriminator loss to peak for a bit.

Here's a sample validation prediction at step 1000. The red-highlighted tokens are the generator's replacements, and the yellow-highlighted tokens are the tokens which the discriminator predicts as corrupted. Note that the discriminator simply predicts that every the and . token is corrupt, and the generator in this case replaces every [MASK] with a ..

Screen Shot 2020-07-08 at 2 34 10 PM

Great to have these training curves! I'm currently working on a new non-English ELECTRA large model. I encountered some problems during pre-training. The loss first drops after a period of time and then rises. I would just like to ask if you can post the learning rate of both base and large models (or even share the learning rate curve) so that we have a kind of reference point about learning rate when training own models . The following is my loss and acc curve. Looking forward to your suggestions.


Thanks many in advance!


Great to have these training curves! I'm currently working on a new non-English ELECTRA large model. I encountered some problems during pre-training. The loss first drops after a period of time and then rises. I would just like to ask if you can post the learning rate of both base and large models (or even share the learning rate curve) so that we have a kind of reference point about learning rate when training own models . The following is my loss and acc curve. Looking forward to your suggestions.


Thanks many in advance!

How did you draw the graph?
When I set the do_eval as true and evaluated the model, it was marked as only one point when I checked with Tensorboard.
How can I check the graph every step?

as only one poin
I also can not draw this graph, are you solve it?