Loss of base and large models
stefan-it opened this issue · comments
Hi,
I'm currently working on a new non-English ELECTRA model. Training on GPU seems to work and is running fine 🤗
Next steps would be to try model training on a TPU, so I would just like to ask if you can post the final loss of both base and large models (or even share the loss training curve) so that we have a kind of reference point when training own models 🤔
Thanks many in advance,
Stefan
Hi Stefan,
Great to see non-English models being trained already! Here are some pre-training curves (x-axis is pre-train steps) for the models. Note that these metrics were computed with "do_eval": true
which means there is no dropout, so training losses will be slightly higher.
Small model on OpenWebText, batch_size=128, mask_percent=0.15
Base model on WikiBooks, batch_size=256, mask_percent=0.15
Large model on XLNet data, batch_size=2048, mask_percent=0.25
The losses will vary depending on your setting because the disciminator loss depends on the quality of the generator, the generator loss can depend on the training data., etc.
Hi @clarkkev,
thanks for these interesting insights! Great to have these training curves 🤗
@clarkkev Do you have any idea about why there's a peak in disc_loss at the very beginning of the training curve?
@yaolu From our own experiments, the generator starts out creating random predictions, then plateaus for a bit using the median prediction. So every [MASK] gets replaced with the token the
or the .
token. This makes the discriminator's job very easy for a bit. Eventually the generator moves past this plateau and starts generating more diverse predictions, which causes the discriminator loss to peak for a bit.
Here's a sample validation prediction at step 1000. The red-highlighted tokens are the generator's replacements, and the yellow-highlighted tokens are the tokens which the discriminator predicts as corrupted. Note that the discriminator simply predicts that every the
and .
token is corrupt, and the generator in this case replaces every [MASK]
with a .
.
Hi,
Great to have these training curves! I'm currently working on a new non-English ELECTRA large model. I encountered some problems during pre-training. The loss first drops after a period of time and then rises. I would just like to ask if you can post the learning rate of both base and large models (or even share the learning rate curve) so that we have a kind of reference point about learning rate when training own models . The following is my loss and acc curve. Looking forward to your suggestions.
Thanks many in advance!
Hi,
Great to have these training curves! I'm currently working on a new non-English ELECTRA large model. I encountered some problems during pre-training. The loss first drops after a period of time and then rises. I would just like to ask if you can post the learning rate of both base and large models (or even share the learning rate curve) so that we have a kind of reference point about learning rate when training own models . The following is my loss and acc curve. Looking forward to your suggestions.Thanks many in advance!
How did you draw the graph?
When I set the do_eval as true and evaluated the model, it was marked as only one point when I checked with Tensorboard.
How can I check the graph every step?
as only one poin
I also can not draw this graph, are you solve it?