bigscience-workshop / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2

the traing log like this is Normal？ I do not find loss in the logs, and what does the "grad norm: nan" mean?

alphanlp opened this issue 9 months ago · comments

alphanlp commented 9 months ago

d norm: nan | actual seqlen: 2048 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.886 | TFLOPs: 78.46 |
iteration 5426/ 250000 | consumed samples: 43408 | consumed tokens: 88899584 | elapsed time per iteration (ms): 4247.9 | learning rate: 2.999E-04 | global batch size: 8 | loss scale: 1.0 | grad norm: nan | actual seqlen: 2048 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.883 | TFLOPs: 78.36 |
iteration 5427/ 250000 | consumed samples: 43416 | consumed tokens: 88915968 | elapsed time per iteration (ms): 4225.8 | learning rate: 2.999E-04 | global batch size: 8 | loss scale: 1.0 | grad norm: nan | actual seqlen: 2048 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.893 | TFLOPs: 78.77 |
iteration 5428/ 250000 | consumed samples: 43424 | consumed tokens: 88932352 | elapsed time per iteration (ms): 4229.2 | learning rate: 2.999E-04 | global batch size: 8 | loss scale: 1.0 | grad norm: nan | actual seqlen: 2048 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.892 | TFLOPs: 78.71 |
iteration 5429/ 250000 | consumed samples: 43432 | consumed tokens: 88948736 | elapsed time per iteration (ms): 4233.6 | learning rate: 2.999E-04 | global batch size: 8 | loss scale: 1.0 | grad norm: nan | actual seqlen: 2048 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.890 | TFLOPs: 78.63 |
iteration 5430/ 250000 | consumed samples: 43440 | consumed tokens: 88965120 | elapsed time per iteration (ms): 4247.0 | learning rate: 2.999E-04 | global batch size: 8 | loss scale: 1.0 | grad norm: nan | actual seqlen: 2048 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.884 | TFLOPs: 78.38 |