NVIDIA / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to handle gradient overflow when training a deep model with mixed precision?

tfwu opened this issue · comments

Hi, it is ok to train the model with fp32, but we would like to take advantage of the speed of mixed precision. Unfortunately, we got gradient overflow and then nan loss at the very beginning with both opt-level O1 and O2. Is there a good way to handle gradient overflow / underflow using apex? Thank you

Hi @tfwu,
do you have a reproducible code snippet so that we can have a look?
The loss should not get NaN values, so this is an issue we would like to investigate.
If you are working with private data, we would just need the shapes and could possibly just use randomly initialized data.

Best,
ptrblck

Occasionally seeing a message like “overflow detected, skipping step, reducing loss scale” is normal behavior with dynamic loss scaling, and it usually happens in the first few iterations because Amp begins by trying a high loss scale.

Seeing nan loss values (ie, the loss scalar resulting from the forward pass is nan or inf) is NOT normal, and indicates something has gone wrong. As Piotr says, if this is the case, a minimal repro would be helpful.

Is is normal to see many (~30) "gradient overflow, skipping step, reducing loss scale" messages on the first epoch? My dataset is pretty large (22 classes, 9M images total) and it didn't make is past the first epoch, after many hours, ie overnight. Granted, I'm only testing on a single RTX2070, but I thought it might have at least gotten to 2 by then. Is that unrealistic? On a tiny dataset, I get a 2-3 gradient overflows on the 1st epoch, and then none (usually) on the subsequent epochs. I'm using OPT=O1.

@gbrow004 Do these messages appear consecutively, or are they spread out? Amp occasionally tries to increase the loss scale, so for a really long epoch or run, there will be proportionately more messages. Aside from a few (2-3) at the beginning of training, these should be isolated rather than consecutive. If you are seeing 30 in a row, that is likely a problem, however. Also, what is the range of loss values it says amp is using?

To be honest, I'm not sure if they are consecutive or not, but I don't think so. I'll let it run longer and see what happens. Thanks for the information. So far, here's the output:

Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2048.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2048.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2048.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2048.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2048.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 4096.0

Since the loss scale is being reduced to 4096 and 2048 several times, it had to be increased in-between, which means the gradient overflows did not occur consecutively.

Makes sense. Thanks, ptrblck! I guess I'll send it to the main GPU cluster and see what happens. It was going painfully slow on my single GPU testbed!

@mcarilli : You said that gradient overflow message can happen in first few epochs. What is happen with my model if gradient overflow happens in every epoch (Maybe 2 or 5 times in each epochs, although I reduced learning rate). How should I fix it?

i am seeking for the answer as @John1231983 asked above,someone please help

I was trying this model in google colab : https://www.kaggle.com/taindow/pytorch-resnext-101-32x8d-benchmark

i reduced the size of the data so colab doesn't crash and also reduced the batch size down to 32 but after almost 2 hours model training i get this error :
ZeroDivisionError: float division by zero

i understand why this error is happening but can't understand what to do to solve this error? is it because batch size = 32 and not 64?

here is my model log :

Epoch 0/0
Iteration
5% 1091/20994 [43:46<13:17:55, 2.41s/it]

Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2048.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1024.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 512.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 256.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 128.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 64.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.5
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.25
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.03125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.015625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0078125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.00390625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.001953125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0009765625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.00048828125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.000244140625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0001220703125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.103515625e-05
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0517578125e-05
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.52587890625e-05
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.62939453125e-06
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.814697265625e-06
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9073486328125e-06
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.5367431640625e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.76837158203125e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.384185791015625e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1920928955078125e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.960464477539063e-08
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9802322387695312e-08
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4901161193847656e-08
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.450580596923828e-09
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.725290298461914e-09
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.862645149230957e-09
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.313225746154785e-10
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.656612873077393e-10
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3283064365386963e-10
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1641532182693481e-10
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.820766091346741e-11
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9103830456733704e-11
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4551915228366852e-11
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.275957614183426e-12
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.637978807091713e-12
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8189894035458565e-12
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.094947017729282e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.547473508864641e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2737367544323206e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1368683772161603e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.684341886080802e-14
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.842170943040401e-14
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4210854715202004e-14
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.105427357601002e-15
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.552713678800501e-15
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7763568394002505e-15
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.881784197001252e-16
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.440892098500626e-16
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.220446049250313e-16
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1102230246251565e-16
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.551115123125783e-17
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7755575615628914e-17
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3877787807814457e-17
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.938893903907228e-18
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.469446951953614e-18
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.734723475976807e-18
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.673617379884035e-19
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.336808689942018e-19
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.168404344971009e-19
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0842021724855044e-19
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.421010862427522e-20
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.710505431213761e-20
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3552527156068805e-20
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.776263578034403e-21
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.3881317890172014e-21
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6940658945086007e-21
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.470329472543003e-22
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.235164736271502e-22
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.117582368135751e-22
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0587911840678754e-22
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.293955920339377e-23
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6469779601696886e-23
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3234889800848443e-23
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.617444900424222e-24
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.308722450212111e-24
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6543612251060553e-24
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.271806125530277e-25
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.1359030627651384e-25
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0679515313825692e-25
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0339757656912846e-25
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.169878828456423e-26
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5849394142282115e-26
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2924697071141057e-26
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.462348535570529e-27
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.2311742677852644e-27
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6155871338926322e-27
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.077935669463161e-28
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0389678347315804e-28
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0194839173657902e-28
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0097419586828951e-28
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.048709793414476e-29
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.524354896707238e-29
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.262177448353619e-29
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.310887241768095e-30
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1554436208840472e-30
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5777218104420236e-30
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.888609052210118e-31
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.944304526105059e-31
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9721522630525295e-31
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.860761315262648e-32
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.930380657631324e-32
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.465190328815662e-32
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.232595164407831e-32
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.162975822039155e-33
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0814879110195774e-33
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5407439555097887e-33
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.703719777548943e-34
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.851859888774472e-34
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.925929944387236e-34
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.62964972193618e-35
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.81482486096809e-35
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.407412430484045e-35
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2037062152420224e-35
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.018531076210112e-36
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.009265538105056e-36
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.504632769052528e-36
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.52316384526264e-37
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.76158192263132e-37
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.88079096131566e-37
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.4039548065783e-38
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.70197740328915e-38
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.350988701644575e-38
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1754943508222875e-38
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.877471754111438e-39
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.938735877055719e-39
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4693679385278594e-39
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.346839692639297e-40
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.6734198463196485e-40
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8367099231598242e-40
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.183549615799121e-41
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.591774807899561e-41
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2958874039497803e-41
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1479437019748901e-41
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.739718509874451e-42
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.8698592549372254e-42
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4349296274686127e-42
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.174648137343064e-43
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.587324068671532e-43
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.793662034335766e-43
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.96831017167883e-44
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.484155085839415e-44
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2420775429197073e-44
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1210387714598537e-44
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.605193857299268e-45
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.802596928649634e-45
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.401298464324817e-45
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.006492321624085e-46
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.503246160812043e-46
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7516230804060213e-46
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.758115402030107e-47
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.3790577010150533e-47
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1895288505075267e-47
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0947644252537633e-47
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.473822126268817e-48
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7369110631344083e-48
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3684555315672042e-48
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.842277657836021e-49
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.4211388289180104e-49
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7105694144590052e-49
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.552847072295026e-50
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.276423536147513e-50
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1382117680737565e-50
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0691058840368783e-50
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.345529420184391e-51
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6727647100921956e-51
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3363823550460978e-51
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.681911775230489e-52
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.3409558876152446e-52
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6704779438076223e-52
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.352389719038111e-53
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.176194859519056e-53
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.088097429759528e-53
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.044048714879764e-53
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.22024357439882e-54
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.61012178719941e-54
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.305060893599705e-54
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.525304467998525e-55
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.2626522339992623e-55
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6313261169996311e-55
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.156630584998156e-56
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.078315292499078e-56
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.039157646249539e-56
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0195788231247695e-56
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.0978941156238473e-57
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5489470578119236e-57
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2744735289059618e-57
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.372367644529809e-58
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1861838222649046e-58
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5930919111324523e-58
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.965459555662261e-59
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.982729777831131e-59
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9913648889155653e-59
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.956824444577827e-60
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.9784122222889134e-60
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.4892061111444567e-60
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2446030555722283e-60
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.223015277861142e-61
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.111507638930571e-61
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5557538194652854e-61
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.778769097326427e-62
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.8893845486632136e-62
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9446922743316068e-62
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.723461371658034e-63
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.861730685829017e-63
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.4308653429145085e-63
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2154326714572542e-63
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.077163357286271e-64
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0385816786431356e-64
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5192908393215678e-64
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.596454196607839e-65
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.7982270983039195e-65
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8991135491519597e-65
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.495567745759799e-66
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.7477838728798994e-66
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3738919364399497e-66
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1869459682199748e-66
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.934729841099874e-67
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.967364920549937e-67
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4836824602749686e-67
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.418412301374843e-68
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.7092061506874214e-68
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8546030753437107e-68
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.273015376718553e-69
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.636507688359277e-69
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3182538441796384e-69
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1591269220898192e-69
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.795634610449096e-70
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.897817305224548e-70
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.448908652612274e-70
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.24454326306137e-71
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.622271631530685e-71
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8111358157653425e-71
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.055679078826712e-72
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.527839539413356e-72
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.263919769706678e-72
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.131959884853339e-72
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.659799424266695e-73
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.8298997121333476e-73
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4149498560666738e-73
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.074749280333369e-74
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.5373746401666845e-74
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7686873200833423e-74
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.843436600416711e-75
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.421718300208356e-75
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.210859150104178e-75
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.105429575052089e-75
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.527147875260445e-76
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7635739376302223e-76
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3817869688151111e-76
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.908934844075556e-77
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.454467422037778e-77
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.727233711018889e-77
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.636168555094445e-78
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.3180842775472223e-78
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1590421387736112e-78
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0795210693868056e-78
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.397605346934028e-79
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.698802673467014e-79
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.349401336733507e-79
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.747006683667535e-80
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.3735033418337674e-80
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6867516709168837e-80
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.433758354584419e-81
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.2168791772922093e-81
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1084395886461046e-81
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0542197943230523e-81
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.271098971615262e-82
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.635549485807631e-82
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3177747429038154e-82
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.588873714519077e-83
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.2944368572595385e-83
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6472184286297693e-83
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.236092143148846e-84
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.118046071574423e-84
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0590230357872116e-84
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0295115178936058e-84
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.147557589468029e-85
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5737787947340145e-85
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2868893973670072e-85
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.434446986835036e-86
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.217223493417518e-86
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.608611746708759e-86
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.043058733543795e-87
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.021529366771898e-87
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.010764683385949e-87
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0053823416929744e-87
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.026911708464872e-88
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.513455854232436e-88
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.256727927116218e-88
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.28363963558109e-89
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.141819817790545e-89
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5709099088952725e-89
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.854549544476363e-90
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.9272747722381812e-90
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9636373861190906e-90
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.818186930595453e-91
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.909093465297727e-91
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.4545467326488633e-91
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2272733663244316e-91
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.136366831622158e-92
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.068183415811079e-92
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5340917079055395e-92
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.670458539527698e-93
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.835229269763849e-93
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9176146348819244e-93
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.588073174409622e-94
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.794036587204811e-94
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3970182936024055e-94
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1985091468012028e-94
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.992545734006014e-95
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.996272867003007e-95
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4981364335015035e-95
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.490682167507517e-96
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.745341083753759e-96
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8726705418768793e-96
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.363352709384397e-97
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.6816763546921983e-97
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3408381773460992e-97
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1704190886730496e-97
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.852095443365248e-98
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.926047721682624e-98
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.463023860841312e-98
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.31511930420656e-99
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.65755965210328e-99
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.82877982605164e-99
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.1438991302582e-100
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5719495651291e-100
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.28597478256455e-100
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.142987391282275e-100
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.714936956411375e-101
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.8574684782056875e-101
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4287342391028437e-101
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.143671195514219e-102
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.5718355977571093e-102
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7859177988785547e-102
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.929588994392773e-103
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.464794497196387e-103
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2323972485981933e-103
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1161986242990967e-103
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.5809931214954833e-104
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7904965607477417e-104
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3952482803738708e-104
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.976241401869354e-105
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.488120700934677e-105
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7440603504673385e-105
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.720301752336693e-106
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.3601508761683463e-106
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1800754380841732e-106
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0900377190420866e-106
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.450188595210433e-107
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7250942976052165e-107
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3625471488026082e-107
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.812735744013041e-108
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.4063678720065206e-108
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7031839360032603e-108
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.515919680016301e-109
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.257959840008151e-109
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1289799200040754e-109
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0644899600020377e-109
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.3224498000101884e-110
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6612249000050942e-110
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3306124500025471e-110
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.653062250012736e-111
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.326531125006368e-111
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.663265562503184e-111
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.31632781251592e-112
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.15816390625796e-112
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.07908195312898e-112
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.03954097656449e-112
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.19770488282245e-113
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.598852441411225e-113
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2994262207056124e-113
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.497131103528062e-114
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.248565551764031e-114
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6242827758820155e-114
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.121413879410078e-115
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.060706939705039e-115
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0303534698525194e-115
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0151767349262597e-115
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.075883674631299e-116
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5379418373156492e-116
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2689709186578246e-116
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.344854593289123e-117
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1724272966445615e-117
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5862136483222808e-117
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.931068241611404e-118
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.965534120805702e-118
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.982767060402851e-118
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.913835302014255e-119
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.9569176510071274e-119
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.4784588255035637e-119
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2392294127517818e-119
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.196147063758909e-120
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0980735318794546e-120
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5490367659397273e-120
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.745183829698637e-121
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.8725919148493183e-121
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9362959574246591e-121
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.681479787123296e-122
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.840739893561648e-122
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.420369946780824e-122
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.210184973390412e-122
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.05092486695206e-123
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.02546243347603e-123
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.512731216738015e-123
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.563656083690075e-124
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.7818280418450374e-124
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8909140209225187e-124
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.454570104612593e-125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.727285052306297e-125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3636425261531484e-125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1818212630765742e-125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.909106315382871e-126
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9545531576914354e-126
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4772765788457177e-126
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.386382894228589e-127
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.6931914471142943e-127
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8465957235571472e-127
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.232978617785736e-128
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.616489308892868e-128
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.308244654446434e-128
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.154122327223217e-128
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.770611636116085e-129
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.8853058180580424e-129
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4426529090290212e-129
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.213264545145106e-130
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.606632272572553e-130
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8033161362862765e-130
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.016580681431383e-131
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5082903407156913e-131
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2541451703578456e-131
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1270725851789228e-131
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.635362925894614e-132
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.817681462947307e-132
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4088407314736535e-132
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.044203657368268e-133
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.522101828684134e-133
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.761050914342067e-133
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.805254571710335e-134
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.4026272858551673e-134
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2013136429275836e-134
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1006568214637918e-134
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.503284107318959e-135
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7516420536594796e-135
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3758210268297398e-135
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.879105134148699e-136
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.4395525670743494e-136
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7197762835371747e-136
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.598881417685874e-137
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.299440708842937e-137
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1497203544214684e-137
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0748601772107342e-137
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.374300886053671e-138
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6871504430268355e-138
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3435752215134178e-138
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.717876107567089e-139
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.3589380537835444e-139
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6794690268917722e-139
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.397345134458861e-140
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.1986725672294305e-140
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0993362836147152e-140
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0496681418073576e-140
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.248340709036788e-141
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.624170354518394e-141
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.312085177259197e-141
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.560425886295985e-142
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.2802129431479926e-142
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6401064715739963e-142
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.200532357869981e-143
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.100266178934991e-143
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0501330894674953e-143
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0250665447337477e-143
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.1253327236687384e-144
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5626663618343692e-144
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2813331809171846e-144
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.406665904585923e-145
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.2033329522929615e-145
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6016664761464807e-145
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.008332380732404e-146
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.004166190366202e-146
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.002083095183101e-146
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0010415475915505e-146
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.0052077379577523e-147
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5026038689788762e-147
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2513019344894381e-147
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.256509672447191e-148
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1282548362235952e-148
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5641274181117976e-148
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.820637090558988e-149
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.910318545279494e-149
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.955159272639747e-149
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.775796363198735e-150
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.887898181599368e-150
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.443949090799684e-150
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.221974545399842e-150
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.10987272699921e-151
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.054936363499605e-151
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5274681817498023e-151
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.637340908749012e-152
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.818670454374506e-152
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.909335227187253e-152
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.546676135936265e-153
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.7733380679681323e-153
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3866690339840662e-153
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1933345169920331e-153
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.966672584960166e-154
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.983336292480083e-154
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4916681462400413e-154
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.458340731200207e-155
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.7291703656001034e-155
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8645851828000517e-155
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.322925914000258e-156
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.661462957000129e-156
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3307314785000646e-156
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1653657392500323e-156
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.826828696250162e-157
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.913414348125081e-157
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4567071740625404e-157
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.283535870312702e-158
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.641767935156351e-158
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8208839675781755e-158
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.104419837890877e-159
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.552209918945439e-159
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2761049594727193e-159
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1380524797363597e-159
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.6902623986817984e-160
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.8451311993408992e-160
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4225655996704496e-160
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.112827998352248e-161
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.556413999176124e-161
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.778206999588062e-161
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.89103499794031e-162
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.445517498970155e-162
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2227587494850775e-162
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1113793747425387e-162
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.556896873712694e-163
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.778448436856347e-163
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3892242184281734e-163
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.946121092140867e-164
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.4730605460704336e-164
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7365302730352168e-164
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.682651365176084e-165
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.341325682588042e-165
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.170662841294021e-165
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0853314206470105e-165
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.426657103235053e-166
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7133285516175262e-166
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3566642758087631e-166
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.783321379043816e-167
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.391660689521908e-167
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.695830344760954e-167
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.47915172380477e-168
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.239575861902385e-168
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1197879309511924e-168
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0598939654755962e-168
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.299469827377981e-169
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6497349136889905e-169
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3248674568444952e-169
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.624337284222476e-170
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.312168642111238e-170
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.656084321055619e-170
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.280421605278095e-171
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.140210802639048e-171
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.070105401319524e-171
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.035052700659762e-171
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.17526350329881e-172
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.587631751649405e-172
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2938158758247024e-172
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.469079379123512e-173
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.234539689561756e-173
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.617269844780878e-173
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.08634922390439e-174
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.043174611952195e-174
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0215873059760975e-174
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0107936529880487e-174
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.053968264940244e-175
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.526984132470122e-175
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.263492066235061e-175
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.317460331175305e-176
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1587301655876523e-176
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5793650827938261e-176
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.896825413969131e-177
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.9484127069845653e-177
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9742063534922827e-177
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.871031767461413e-178
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.935515883730707e-178
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.4677579418653533e-178
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2338789709326767e-178
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.169394854663383e-179
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.084697427331692e-179
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.542348713665846e-179
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.71174356832923e-180
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.855871784164615e-180
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9279358920823073e-180
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.639679460411536e-181
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.819839730205768e-181
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.409919865102884e-181
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.204959932551442e-181
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.02479966275721e-182
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.012399831378605e-182
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5061999156893026e-182
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.530999578446513e-183
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.7654997892232564e-183
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8827498946116282e-183
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.413749473058141e-184
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.706874736529071e-184
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3534373682645353e-184
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1767186841322676e-184
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.883593420661338e-185
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.941796710330669e-185
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4708983551653345e-185
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.354491775826673e-186
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.6772458879133364e-186
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8386229439566682e-186
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.193114719783341e-187
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5965573598916705e-187
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2982786799458352e-187
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1491393399729176e-187
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.745696699864588e-188
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.872848349932294e-188
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.436424174966147e-188
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.182120874830735e-189
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.5910604374153675e-189
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7955302187076838e-189
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.977651093538419e-190
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.4888255467692094e-190
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2444127733846047e-190
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1222063866923024e-190
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.611031933461512e-191
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.805515966730756e-191
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.402757983365378e-191
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.01378991682689e-192
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.506894958413445e-192
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7534474792067224e-192
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.767237396033612e-193
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.383618698016806e-193
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.191809349008403e-193
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0959046745042015e-193
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.479523372521008e-194
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.739761686260504e-194
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.369880843130252e-194
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.84940421565126e-195
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.42470210782563e-195
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.712351053912815e-195
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.561755269564074e-196
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.280877634782037e-196
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1404388173910186e-196
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0702194086955093e-196
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.351097043477547e-197
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6755485217387732e-197
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3377742608693866e-197
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.688871304346933e-198
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.3444356521734666e-198
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6722178260867333e-198
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.361089130433666e-199
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.180544565216833e-199
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0902722826084166e-199
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0451361413042083e-199
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.225680706521042e-200
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.612840353260521e-200
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3064201766302604e-200
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.532100883151302e-201
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.266050441575651e-201
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6330252207878255e-201
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.165126103939127e-202
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.082563051969564e-202
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.041281525984782e-202
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.020640762992391e-202
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.103203814961955e-203
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5516019074809773e-203
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2758009537404886e-203
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.379004768702443e-204
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1895023843512216e-204
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5947511921756108e-204
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.973755960878054e-205
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.986877980439027e-205
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9934389902195135e-205
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.967194951097568e-206
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.983597475548784e-206
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.491798737774392e-206
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.245899368887196e-206
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.22949684443598e-207
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.11474842221799e-207
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.557374211108995e-207
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.786871055544975e-208
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.8934355277724873e-208
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9467177638862437e-208
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.733588819431218e-209
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.866794409715609e-209
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.4333972048578046e-209
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2166986024289023e-209
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.083493012144512e-210
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.041746506072256e-210
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.520873253036128e-210
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.60436626518064e-211
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.80218313259032e-211
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.90109156629516e-211
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.5054578314758e-212
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.7527289157379e-212
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.37636445786895e-212
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.188182228934475e-212
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.940911144672375e-213
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9704555723361872e-213
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4852277861680936e-213
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.426138930840468e-214
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.713069465420234e-214
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.856534732710117e-214
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.282673663550585e-215
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.641336831775293e-215
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3206684158876463e-215
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1603342079438231e-215
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.801671039719116e-216
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.900835519859558e-216
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.450417759929779e-216
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.252088799648895e-217
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.6260443998244473e-217
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8130221999122236e-217
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.065110999561118e-218
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.532555499780559e-218
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2662777498902796e-218
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1331388749451398e-218
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.665694374725699e-219
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.8328471873628494e-219
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4164235936814247e-219
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.082117968407124e-220
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.541058984203562e-220
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.770529492101781e-220
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.852647460508905e-221
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.4263237302544523e-221
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2131618651272261e-221
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1065809325636131e-221
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.5329046628180653e-222
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7664523314090327e-222
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3832261657045163e-222
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.916130828522582e-223
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.458065414261291e-223
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7290327071306454e-223
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.645163535653227e-224
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.322581767826614e-224
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.161290883913307e-224
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0806454419566534e-224
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.403227209783267e-225
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7016136048916335e-225
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3508068024458167e-225
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.754034012229084e-226
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.377017006114542e-226
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.688508503057271e-226
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.442542515286355e-227
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.2212712576431773e-227
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1106356288215886e-227
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0553178144107943e-227
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.276589072053972e-228
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.638294536026986e-228
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.319147268013493e-228
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.595736340067465e-229
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.2978681700337323e-229
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6489340850168661e-229
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.244670425084331e-230
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.1223352125421653e-230
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0611676062710827e-230
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0305838031355413e-230
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.152919015677707e-231
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5764595078388533e-231
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2882297539194267e-231
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.441148769597133e-232
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.220574384798567e-232
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6102871923992833e-232
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.051435961996417e-233
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0257179809982083e-233
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0128589904991042e-233
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0064294952495521e-233
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.0321474762477604e-234
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5160737381238802e-234
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2580368690619401e-234
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.290184345309701e-235
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1450921726548502e-235
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5725460863274251e-235
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.862730431637126e-236
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.931365215818563e-236
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9656826079092814e-236
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.828413039546407e-237
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.914206519773204e-237
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.457103259886602e-237
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.228551629943301e-237
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.142758149716505e-238
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0713790748582522e-238
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5356895374291261e-238
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.678447687145631e-239
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.8392238435728152e-239
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9196119217864076e-239
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.598059608932038e-240
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.799029804466019e-240
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3995149022330095e-240
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1997574511165048e-240
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.998787255582524e-241
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.999393627791262e-241
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.499696813895631e-241
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.498484069478155e-242
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.7492420347390774e-242
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8746210173695387e-242
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.373105086847693e-243
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.686552543423847e-243
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3432762717119234e-243
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1716381358559617e-243
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.858190679279809e-244
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9290953396399042e-244
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4645476698199521e-244
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.322738349099761e-245
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.6613691745498803e-245
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8306845872749401e-245
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.153422936374701e-246
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5767114681873503e-246
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2883557340936752e-246
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1441778670468376e-246
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.720889335234188e-247
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.860444667617094e-247
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.430222333808547e-247
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.151111669042735e-248
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.5755558345213674e-248
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7877779172606837e-248
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.938889586303419e-249
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.4694447931517093e-249
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2347223965758547e-249
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1173611982879273e-249
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.586805991439637e-250
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7934029957198183e-250
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3967014978599092e-250
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.983507489299546e-251
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.491753744649773e-251
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7458768723248864e-251
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.729384361624432e-252
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.364692180812216e-252
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.182346090406108e-252
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.091173045203054e-252
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.45586522601527e-253
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.727932613007635e-253
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3639663065038175e-253
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.819831532519088e-254
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.409915766259544e-254
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.704957883129772e-254
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.52478941564886e-255
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.26239470782443e-255
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.131197353912215e-255
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0655986769561075e-255
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.327993384780537e-256
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6639966923902686e-256
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3319983461951343e-256
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.659991730975672e-257
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.329995865487836e-257
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.664997932743918e-257
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.32498966371959e-258
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.162494831859795e-258
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0812474159298974e-258
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0406237079649487e-258
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.2031185398247434e-259
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.6015592699123717e-259
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3007796349561859e-259
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.503898174780929e-260
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.2519490873904646e-260
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6259745436952323e-260
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.129872718476162e-261
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.064936359238081e-261
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0324681796190404e-261
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0162340898095202e-261
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.081170449047601e-262
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5405852245238005e-262
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2702926122619002e-262
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.351463061309501e-263
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.1757315306547506e-263
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5878657653273753e-263
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.939328826636877e-264
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.9696644133184383e-264
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9848322066592191e-264
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.924161033296096e-265
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.962080516648048e-265
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.481040258324024e-265
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.240520129162012e-265
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.20260064581006e-266
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.10130032290503e-266
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.550650161452515e-266
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.753250807262575e-267
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.8766254036312874e-267
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9383127018156437e-267
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.691563509078218e-268
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.845781754539109e-268
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.4228908772695546e-268
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2114454386347773e-268
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.057227193173887e-269
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0286135965869433e-269
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5143067982934716e-269
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.571533991467358e-270
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.785766995733679e-270
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8928834978668395e-270
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.464417489334198e-271
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.732208744667099e-271
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3661043723335494e-271
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1830521861667747e-271
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.915260930833874e-272
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.957630465416937e-272
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4788152327084684e-272
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.394076163542342e-273
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.697038081771171e-273
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8485190408855855e-273
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.242595204427927e-274
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.621297602213964e-274
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.310648801106982e-274
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.155324400553491e-274
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.776622002767455e-275
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.8883110013837273e-275
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4441555006918637e-275
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.220777503459318e-276
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.610388751729659e-276
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8051943758648296e-276
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.025971879324148e-277
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.512985939662074e-277
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.256492969831037e-277
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1282464849155185e-277
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.641232424577593e-278
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.8206162122887962e-278
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4103081061443981e-278
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.051540530721991e-279
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.5257702653609953e-279
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7628851326804976e-279
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.814425663402488e-280
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.407212831701244e-280
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.203606415850622e-280
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.101803207925311e-280
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.509016039626555e-281
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7545080198132776e-281
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3772540099066388e-281
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.886270049533194e-282
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.443135024766597e-282
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7215675123832985e-282
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.607837561916492e-283
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.303918780958246e-283
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.151959390479123e-283
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0759796952395615e-283
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.379898476197808e-284
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.689949238098904e-284
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.344974619049452e-284
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.72487309524726e-285
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.36243654762363e-285
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.681218273811815e-285
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.406091369059075e-286
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.2030456845295373e-286
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1015228422647686e-286
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0507614211323843e-286
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.253807105661922e-287
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.626903552830961e-287
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3134517764154804e-287
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.567258882077402e-288
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.283629441038701e-288
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6418147205193505e-288
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.209073602596753e-289
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.1045368012983762e-289
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0522684006491881e-289
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0261342003245941e-289
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.1306710016229703e-290
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.5653355008114852e-290
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.2826677504057426e-290
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.413338752028713e-291
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.2066693760143564e-291
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6033346880071782e-291
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.016673440035891e-292
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.008336720017946e-292
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.004168360008973e-292
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0020841800044864e-292
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.010420900022432e-293
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.505210450011216e-293
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.252605225005608e-293
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.26302612502804e-294
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.13151306251402e-294
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.56575653125701e-294
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.82878265628505e-295
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.914391328142525e-295
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9571956640712625e-295
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.785978320356312e-296
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.892989160178156e-296
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.446494580089078e-296
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.223247290044539e-296
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.116236450222695e-297
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0581182251113476e-297
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.5290591125556738e-297
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.645295562778369e-298
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.8226477813891845e-298
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9113238906945923e-298
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.556619453472961e-299
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.778309726736481e-299
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3891548633682403e-299
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1945774316841202e-299
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.972887158420601e-300
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9864435792103004e-300
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4932217896051502e-300
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.466108948025751e-301
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.7330544740128755e-301
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8665272370064378e-301
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.332636185032189e-302
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.6663180925160944e-302
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3331590462580472e-302
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1665795231290236e-302
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.832897615645118e-303
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.916448807822559e-303
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4582244039112795e-303
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.291122019556398e-304
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.645561009778199e-304
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8227805048890994e-304
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.113902524445497e-305
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.5569512622227484e-305
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2784756311113742e-305
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1392378155556871e-305
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.696189077778436e-306
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.848094538889218e-306
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.424047269444609e-306
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.120236347223045e-307
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.5601181736115222e-307
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7800590868057611e-307
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.900295434028806e-308
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.450147717014403e-308
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2250738585072014e-308
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1125369292536007e-308
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.562684646268003e-309
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.781342323134e-309
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.390671161567e-309
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.953355807835e-310
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.4766779039175e-310
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.73833895195875e-310
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.691694759794e-311
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.345847379897e-311
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.1729236899484e-311
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.086461844974e-311
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.43230922487e-312
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.716154612436e-312
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.35807730622e-312
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.7903865311e-313
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.39519326554e-313
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.69759663277e-313
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.487983164e-314
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.243991582e-314
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.121995791e-314
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0609978955e-314
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.304989477e-315
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.65249474e-315
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.32624737e-315
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.63123685e-316
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.3156184e-316
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6578092e-316
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.289046e-317
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.144523e-317
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0722615e-317
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.036131e-317
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.180654e-318
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.590327e-318
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.295163e-318
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.4758e-319
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.2379e-319
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.61895e-319
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.095e-320
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0474e-320
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0237e-320
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.012e-320
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.06e-321
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.53e-321
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.265e-321
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.3e-322
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.16e-322
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.6e-322
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8e-323
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4e-323
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2e-323
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1e-323
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5e-324
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0

ZeroDivisionError Traceback (most recent call last)

Hi, I encountered similar problems during my training process.
Something happened like below:
Grad overflow on iteration 123084. Using dynamic loss scale of 65536.0 Grad overflow on iteration 124221. Using dynamic loss scale of 65536.0 Grad overflow on iteration 125432. Using dynamic loss scale of 65536.0 Grad overflow on iteration 126241. Using dynamic loss scale of 65536.0 Grad overflow on iteration 127589. Using dynamic loss scale of 65536.0 Grad overflow on iteration 128457. Using dynamic loss scale of 65536.0
There are about 140000 iterations for one epoch and it seems it happens every 1000 iterations.
I was wondering whether it hurts the performance? Is it normal or not normal?
Thanks very much!

Hi, I have encountered the same problem.

Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16384.0

Is there any solution to solve this problem?

#318 (comment)
#318 (comment)
Getting this message occasionally (every 1000 iterations or so) is normal. It happens when Amp attempts to automatically adjust the scale. Sometimes, it tries a bad scale factor. In such cases, it skips the step and reduces the scale.

The message may occur several times in succession at the beginning of training as the scale value calibrates.

@mobassir94 have your problem been solved ? I get stuck in the same problem.
I set max loss scale to 128.0

My log is as follows: @mcarilli

grad_max = 0.001679 grad_avg = 0.000126
clipped_grad_max = 0.001679 clipped_grad_avg = 0.000126
iter 0: loss = 2.995887 dice = 0.001561
grad_max = 0.001803 grad_avg = 0.000143
clipped_grad_max = 0.001803 clipped_grad_avg = 0.000143
iter 1: loss = 2.995314 dice = 0.001781
grad_max = 0.003364 grad_avg = 0.000241
clipped_grad_max = 0.003364 clipped_grad_avg = 0.000241
iter 2: loss = 2.993279 dice = 0.002536
grad_max = 0.005255 grad_avg = 0.000370
clipped_grad_max = 0.005255 clipped_grad_avg = 0.000370
iter 3: loss = 2.993289 dice = 0.002550
grad_max = 0.003805 grad_avg = 0.000267
clipped_grad_max = 0.003805 clipped_grad_avg = 0.000267
iter 4: loss = 2.994864 dice = 0.001978
grad_max = 0.005750 grad_avg = 0.000432
clipped_grad_max = 0.005750 clipped_grad_avg = 0.000432
iter 5: loss = 2.992242 dice = 0.002897
grad_max = 0.005317 grad_avg = 0.000401
clipped_grad_max = 0.005317 clipped_grad_avg = 0.000401
iter 6: loss = 2.994675 dice = 0.001973
grad_max = 0.006786 grad_avg = 0.000489
clipped_grad_max = 0.006786 clipped_grad_avg = 0.000489
iter 7: loss = 2.993610 dice = 0.002465
grad_max = 0.007121 grad_avg = 0.000515
clipped_grad_max = 0.007121 clipped_grad_avg = 0.000515
iter 8: loss = 2.994181 dice = 0.002169
grad_max = 0.007923 grad_avg = 0.000573
clipped_grad_max = 0.007923 clipped_grad_avg = 0.000573
iter 9: loss = 2.994954 dice = 0.001854
grad_max = 0.005527 grad_avg = 0.000415
clipped_grad_max = 0.005527 clipped_grad_avg = 0.000415
iter 10: loss = 2.996653 dice = 0.001214
grad_max = 0.006917 grad_avg = 0.000537
clipped_grad_max = 0.006917 clipped_grad_avg = 0.000537
iter 11: loss = 2.996503 dice = 0.001257
grad_max = 0.006592 grad_avg = 0.000576
clipped_grad_max = 0.006592 clipped_grad_avg = 0.000576
iter 12: loss = 2.995537 dice = 0.001560
grad_max = 0.005741 grad_avg = 0.000619
clipped_grad_max = 0.005741 clipped_grad_avg = 0.000619
iter 13: loss = 2.994136 dice = 0.001929
grad_max = 0.003185 grad_avg = 0.000449
clipped_grad_max = 0.003185 clipped_grad_avg = 0.000449
iter 14: loss = 2.995934 dice = 0.001396
grad_max = 0.001697 grad_avg = 0.000354
clipped_grad_max = 0.001697 clipped_grad_avg = 0.000354
iter 15: loss = 2.996895 dice = 0.001017
grad_max = 0.002463 grad_avg = 0.000245
clipped_grad_max = 0.002463 clipped_grad_avg = 0.000245
iter 16: loss = 2.995060 dice = 0.001652
grad_max = 0.003388 grad_avg = 0.000195
clipped_grad_max = 0.003388 clipped_grad_avg = 0.000195
iter 17: loss = 2.996236 dice = 0.001355
grad_max = 0.003118 grad_avg = 0.000121
clipped_grad_max = 0.003118 clipped_grad_avg = 0.000121
iter 18: loss = 2.997606 dice = 0.000888
grad_max = 0.003628 grad_avg = 0.000104
clipped_grad_max = 0.003628 clipped_grad_avg = 0.000104
iter 19: loss = 2.997371 dice = 0.000941
grad_max = 0.006864 grad_avg = 0.000145
clipped_grad_max = 0.006864 clipped_grad_avg = 0.000145
iter 20: loss = 2.995636 dice = 0.001620
grad_max = 0.005955 grad_avg = 0.000143
clipped_grad_max = 0.005955 clipped_grad_avg = 0.000143
iter 21: loss = 2.998232 dice = 0.000696
grad_max = 0.005847 grad_avg = 0.000206
clipped_grad_max = 0.005847 clipped_grad_avg = 0.000206
iter 22: loss = 2.997889 dice = 0.000813
grad_max = 0.005149 grad_avg = 0.000119
clipped_grad_max = 0.005149 clipped_grad_avg = 0.000119
iter 23: loss = 2.997258 dice = 0.000982
grad_max = 0.000666 grad_avg = 0.000013
clipped_grad_max = 0.000666 clipped_grad_avg = 0.000013
iter 24: loss = 2.997273 dice = 0.000989
grad_max = 0.001956 grad_avg = 0.000037
clipped_grad_max = 0.001956 clipped_grad_avg = 0.000037
iter 25: loss = 2.997074 dice = 0.001038
grad_max = 0.004974 grad_avg = 0.000094
clipped_grad_max = 0.004974 clipped_grad_avg = 0.000094
iter 26: loss = 2.996999 dice = 0.001086
grad_max = 0.011436 grad_avg = 0.000219
clipped_grad_max = 0.011436 clipped_grad_avg = 0.000219
iter 27: loss = 2.996643 dice = 0.001152
grad_max = 0.009122 grad_avg = 0.000173
clipped_grad_max = 0.009122 clipped_grad_avg = 0.000173
iter 28: loss = 2.997333 dice = 0.000961
grad_max = 0.007901 grad_avg = 0.000156
clipped_grad_max = 0.007901 clipped_grad_avg = 0.000156
iter 29: loss = 2.997518 dice = 0.000829
grad_max = 0.006899 grad_avg = 0.000135
clipped_grad_max = 0.006899 clipped_grad_avg = 0.000135
iter 30: loss = 2.996865 dice = 0.001099
grad_max = 0.000702 grad_avg = 0.000013
clipped_grad_max = 0.000702 clipped_grad_avg = 0.000013
iter 31: loss = 2.997784 dice = 0.000751
grad_max = 0.000005 grad_avg = 0.000000
clipped_grad_max = 0.000005 clipped_grad_avg = 0.000000
iter 32: loss = 2.996029 dice = 0.001361
grad_max = 0.000000 grad_avg = 0.000000
clipped_grad_max = 0.000000 clipped_grad_avg = 0.000000
iter 33: loss = 2.996109 dice = 0.001328
grad_max = 0.000002 grad_avg = 0.000000
clipped_grad_max = 0.000002 clipped_grad_avg = 0.000000
iter 34: loss = 2.997235 dice = 0.000935
grad_max = 0.000000 grad_avg = 0.000000
clipped_grad_max = 0.000000 clipped_grad_avg = 0.000000
iter 35: loss = 2.997193 dice = 0.000968
grad_max = 0.000000 grad_avg = 0.000000
clipped_grad_max = 0.000000 clipped_grad_avg = 0.000000
iter 36: loss = 2.997750 dice = 0.000755
grad_max = 0.000000 grad_avg = 0.000000
clipped_grad_max = 0.000000 clipped_grad_avg = 0.000000
iter 37: loss = 2.996224 dice = 0.001274
grad_max = 0.000000 grad_avg = 0.000000
clipped_grad_max = 0.000000 clipped_grad_avg = 0.000000
iter 38: loss = 2.998214 dice = 0.000612
grad_max = 0.000000 grad_avg = 0.000000
clipped_grad_max = 0.000000 clipped_grad_avg = 0.000000
iter 39: loss = 2.997241 dice = 0.000937
grad_max = 0.000000 grad_avg = 0.000000
clipped_grad_max = 0.000000 clipped_grad_avg = 0.000000
iter 40: loss = 2.995601 dice = 0.001470
grad_max = 0.000004 grad_avg = 0.000000
clipped_grad_max = 0.000004 clipped_grad_avg = 0.000000
iter 41: loss = 2.996757 dice = 0.001082
grad_max = 0.000000 grad_avg = 0.000000
clipped_grad_max = 0.000000 clipped_grad_avg = 0.000000
iter 42: loss = 2.996395 dice = 0.001212
grad_max = 0.000000 grad_avg = 0.000000
clipped_grad_max = 0.000000 clipped_grad_avg = 0.000000
iter 43: loss = 2.996039 dice = 0.001326
grad_max = 0.000000 grad_avg = 0.000000
clipped_grad_max = 0.000000 clipped_grad_avg = 0.000000
iter 44: loss = 2.996497 dice = 0.001177
grad_max = 0.000000 grad_avg = 0.000000
clipped_grad_max = 0.000000 clipped_grad_avg = 0.000000
iter 45: loss = 2.997445 dice = 0.000853
grad_max = 0.000000 grad_avg = 0.000000
clipped_grad_max = 0.000000 clipped_grad_avg = 0.000000
iter 46: loss = 2.997257 dice = 0.000921
grad_max = 0.000000 grad_avg = 0.000000
clipped_grad_max = 0.000000 clipped_grad_avg = 0.000000
iter 47: loss = 2.997901 dice = 0.000702
grad_max = 0.000000 grad_avg = 0.000000
clipped_grad_max = 0.000000 clipped_grad_avg = 0.000000
iter 48: loss = 2.997082 dice = 0.000975
grad_max = 0.000000 grad_avg = 0.000000
clipped_grad_max = 0.000000 clipped_grad_avg = 0.000000
iter 49: loss = 2.996974 dice = 0.001009
grad is nan: torch.Size([4, 1, 3, 3, 3])
tensor([[[[[nan, nan, nan],
           [nan, nan, nan],
           [nan, nan, nan]],

it seems like you have nan or null in input and that's why you are getting that error @chengmengli06

@mobassir94 The forward pass overflows, if I make the learning rate smaller from 1e-3 to 1e-4, then it could run for 20 epochs, then the forward pass overflows again.

I am facing a similar issue while training a GAN architecture with a pre-trained generator. The logs look like this:
Gradient overflow. Skipping step, loss scaler 2 reducing loss scale to 32768.0
Gradient overflow. Skipping step, loss scaler 2 reducing loss scale to 16384.0
Gradient overflow. Skipping step, loss scaler 2 reducing loss scale to 8192.0
.
.
.
The loss scale goes to zero and results in an error saying NaN or Inf found in input tensor. Is there a fix for this? The earlier conversation seems open-ended.
I am using opt_level O1.
Thanks in advance.

me too:
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0 repeated 5 times

@vishal16babu
This behavior is typically observed, when the output or your model or the loss gets a sudden NaN value.
This might happen, if e.g. your training is not stable or your input data contains invalid values.

Does your model train in FP32 without apex?

@devsentient
Are you also observing a division by zero or just these 5 steps in downgrading the loss scaler (which is normal)?

hey @ptrblck - I just ran into this issue while doing some investigation. I'm uncertain whether apex O1 mode can be used on a FP32 trained model without apex. Please see #750

Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
repeats until it is 0 and get an error.
Have anyone solved this?

Hey, I got the same ZeroDevisionProblem as following:

Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 262144.0                                      
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 262144.0                                      
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8192.0                                        
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8192.0                                        
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 256.0                                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 256.0                                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8.0                                           
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8.0                                           
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.25                                          
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.25  
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.0078125                                     
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.0078125                                     
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.000244140625                                
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.000244140625                                
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.62939453125e-06                             
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.62939453125e-06                             
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.384185791015625e-07                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.384185791015625e-07                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.450580596923828e-09                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.450580596923828e-09                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.3283064365386963e-10                        
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.3283064365386963e-10                        
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.275957614183426e-12                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.275957614183426e-12                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.2737367544323206e-13                        
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.2737367544323206e-13                        
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.105427357601002e-15                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.105427357601002e-15                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.220446049250313e-16                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.220446049250313e-16                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.938893903907228e-18                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.938893903907228e-18                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.168404344971009e-19                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.168404344971009e-19                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.776263578034403e-21                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.776263578034403e-21 

then I got Nan

step 2900/210000 (669 example/last step); acc:   0.00; ppl:   nan
; xent:  nan; lr: 0.00000191;   0/26651 tok/s;   5468 sec   
-- many iterations ----
[Logger(3)] [2020-03-20 03:55:42,898 WARNING] NaN or Inf found in input tensor. 

then after may Nan iteration, It finally raise

  File "/opt/conda/envs/py36/lib/python3.6/site-packages/apex/amp/_process_optimizer.py", line 135, in post_backward_m
odels_are_masters                                                                                                     
    scale_override=(grads_have_scale, stashed_have_scale, out_scale))                                                 
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/apex/amp/scaler.py", line 176, in unscale_with_stashed       
    out_scale/grads_have_scale,   # 1./scale,                                                                         
ZeroDivisionError: float division by zero    

Our Usage:

  1. multi-cards training with NCCL (as normal)
  2. gradient accumulative, like the demo use:
for batch in accumulated_batch:
   # forward
   # loss = creterion(xxx)
    with amp.scale_loss(loss, optimizer) as scaled_loss:
        scaled_loss.backward()

if self.grad_accum_count > 1:
       if self.n_gpu > 1:
            grads = [p.grad.data for p in self.model.parameters()
                         if p.requires_grad
                         and p.grad is not None]
           # NOTE, P1: SYNC multi-cards
            distributed.all_reduce_and_rescale_tensors(
                    grads, float(1))
# NOTE, P2: do step
optimizer.step()
optimizer.zero_grad()

I'm worry the preious P1 and P2 can cause problem:

  1. P1: if each card don't overflow, but after redunce sum, can it be Overflow?
  2. P2: if 1 batch in accumulated-batch overflow, but other batch is ok, the final optimizer.step can cause overflow (I thought apm will track the backward status about whether it has overflow in backward, if overflow, the step will do nothing? but if we call backward multi-times, the the should-really-step will be corrupted?)

Currenlty, the fp32 training is ok (about 9 Epochs); amp O1 failed at Epoch 1

@mcarilli could you please take some time to process this? I'm waiting online, Thanks


Hey, has anyone processing this problem?

I have tried:

  1. look at the amp imp coarsely.
    I found if overflow, it seems just set the grads to Zero. so the step can't cause NAN. so previous P2 is impossible.

  2. I run this code in SingleGPU, it's OK after 1 night. (but still in Epoch1) it has some Overflow, but just training fine. no Nan/Inf.

  3. from fairseq, I saw it set the min_loss_scale, and then I see the max_loss_scale which I haven't ever concern. I set the max_loss_scale == 2**16 (as the default init value, it don't overflow) and min_loss_scale=1; Under this setting, I run the 8 card training 1 night(In Epoch 4); Now it works fine, seems No any Overflow.

Over all, I think the previous P1 is dangerous and probably cause the Problem? what's more, I'm also concern the NLL loss in fp16 mode will cause Overflow/INF/NAN, but in 2, 3, it haven't occurs.

May be some suggestion or conclusion?

——————

Oh, I Finally figure it out.

Let me summarize it:

the Amp has_overflow state haven't synced across all progress when multi-card traning.
so when 1 card overflow, it will skip it correctly, BUT the other process it not overflow will not!
because we have sync the grads, so the grads have invalid if any process overflow. so the finally total messed up.

**How Amp do when come across a overflow? ** It just hack the optimizer's step function! to a new function, it print the Gradient Overflow message and then change the step back.

**How to deal with this situation? ** It is easy, just sync all the Amp state across all process.

my code is based on OpenNMT-py, the sync state is easy, but we need to get whether amp is overflow and hack the step as the amp does. I just copy and modifying the amp source code, as following:

        def _step():
            """step function wrapper. totally safe for fp32"""
            def _multiprocess_sync_amp_is_overflow():
                """need sync optimizer is-overflow state when multiprocessing"""
                if self.args.model_dtype != "fp16":
                    return False
                # get current process amp state
                local_overflow_cnt = 0
                for o in self.optims:
                    if o.optimizer._amp_stash.already_patched:
                        local_overflow_cnt = 1
                        break
                # print(f"Device {self.gpu_rank} local overflow state: {local_overflow_cnt}")
                # Sync the global overflow
                global_overflow_cnt = local_overflow_cnt
                if self.n_gpu > 1:
                    global_overflow_cnt = sum(distributed.all_gather_list(global_overflow_cnt))
                # print(f"Gloal overflow state: {global_overflow_cnt}")
                is_global_overflow = global_overflow_cnt > 0
                return is_global_overflow

            def patch_step(opt):
                """this function is copied from apex"""
                opt_step = opt.step
                def skip_step(closure=None):
                    if closure is not None:
                        raise RuntimeError("Currently, Amp does not support closure use with optimizers.")
                    logger.info(f"Device[{self.gpu_rank}] Gradient overflow. Skipping step. "
                            "(This is from hack-for-optimizer-sync)")
                    if hasattr(opt._amp_stash, "all_fp32_from_fp16_params"):
                        # Clear the master grads that wouldn't be zeroed by model.zero_grad()
                        for param in opt._amp_stash.all_fp32_from_fp16_params:
                            param.grad = None
                    if hasattr(opt, "most_recent_scale"):
                        opt.most_recent_scale = 1.0
                        opt.scale_set_by_backward = False
                    opt.step = opt_step
                    opt._amp_stash.already_patched = False
                return skip_step

            if self.n_gpu > 1:
                is_global_overflow = _multiprocess_sync_amp_is_overflow()
                if is_global_overflow:
                    # hack the optimizer
                    for o in self.optims:
                        if o.optimizer._amp_stash.already_patched:
                            continue
                        o.optimizer.step = patch_step(o.optimizer)
                        o.optimizer._amp_stash.already_patched = True

previous code solve my problem.
Good Luck.

Does your model train normally without fp16?
I found my problem is that there are some nans in my dataset leading to nan gradient/gradient overflow. After correcting my data, it trained without error.

Does your model train normally without fp16?
I found my problem is that there are some nans in my dataset leading to nan gradient/gradient overflow. After correcting my data, it trained without error.

Oh, it is great! Congratulation!
I'm running fp32 about 3 days, it still keep normal currenlty (abount 9 Epoch). but fp16 failed at Epoch 1

@ptrblck Is there a way to tell AMP not to screw up with the loss function at all? I have the same problem with gradient overflow continuously appearing (there are a few steps where it doesn't appear but the majority - 99% - are gradient overflows).

I have already tried to set the amp.initializer loss_scale to 1 but that results in nans from the get-go.

I also meet this issue, can anyone tell me a good solution?

Hi!
Facing a similar issue here while training on multi-gpu with 'O1':

Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.52587890625e-05
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.3283064365386963e-10
....
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.72487309524726e-285
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.0261342003245941e-289
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 1.56575653125701e-294
Epoch: 0, Step: 1000 / 230430, loss = nan

With opt_level 'O0' it runs fine making me think it is not the data problem.
Also tried running on a single GPU with 'O1' to make sure it is not the synchronization problem, but the behaviour is the same for one and multiple GPUs, it is also the same with opt_level='O2' the step skipping pattern is the same but it falls with

cpu_sum = float(model_grad.float().sum())
RuntimeError: CUDA error: an illegal memory access was encountered 

before returning a nan loss

commented

Hey, I got the same ZeroDevisionProblem as following:

Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 262144.0                                      
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 262144.0                                      
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8192.0                                        
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8192.0                                        
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 256.0                                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 256.0                                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8.0                                           
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8.0                                           
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.25                                          
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.25  
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.0078125                                     
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.0078125                                     
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.000244140625                                
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.000244140625                                
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.62939453125e-06                             
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.62939453125e-06                             
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.384185791015625e-07                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.384185791015625e-07                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.450580596923828e-09                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.450580596923828e-09                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.3283064365386963e-10                        
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.3283064365386963e-10                        
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.275957614183426e-12                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.275957614183426e-12                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.2737367544323206e-13                        
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.2737367544323206e-13                        
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.105427357601002e-15                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.105427357601002e-15                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.220446049250313e-16                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.220446049250313e-16                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.938893903907228e-18                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.938893903907228e-18                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.168404344971009e-19                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.168404344971009e-19                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.776263578034403e-21                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.776263578034403e-21 

then I got Nan

step 2900/210000 (669 example/last step); acc:   0.00; ppl:   nan
; xent:  nan; lr: 0.00000191;   0/26651 tok/s;   5468 sec   
-- many iterations ----
[Logger(3)] [2020-03-20 03:55:42,898 WARNING] NaN or Inf found in input tensor. 

then after may Nan iteration, It finally raise

  File "/opt/conda/envs/py36/lib/python3.6/site-packages/apex/amp/_process_optimizer.py", line 135, in post_backward_m
odels_are_masters                                                                                                     
    scale_override=(grads_have_scale, stashed_have_scale, out_scale))                                                 
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/apex/amp/scaler.py", line 176, in unscale_with_stashed       
    out_scale/grads_have_scale,   # 1./scale,                                                                         
ZeroDivisionError: float division by zero    

Our Usage:

1. multi-cards training with NCCL (as normal)

2. gradient accumulative, like the demo use:
for batch in accumulated_batch:
   # forward
   # loss = creterion(xxx)
    with amp.scale_loss(loss, optimizer) as scaled_loss:
        scaled_loss.backward()

if self.grad_accum_count > 1:
       if self.n_gpu > 1:
            grads = [p.grad.data for p in self.model.parameters()
                         if p.requires_grad
                         and p.grad is not None]
           # NOTE, P1: SYNC multi-cards
            distributed.all_reduce_and_rescale_tensors(
                    grads, float(1))
# NOTE, P2: do step
optimizer.step()
optimizer.zero_grad()

I'm worry the preious P1 and P2 can cause problem:

1. P1: if each card don't overflow, but after redunce sum, can it be Overflow?

2. P2: if 1 batch in accumulated-batch overflow, but other batch is ok, the final `optimizer.step` can cause overflow (I thought apm will track the backward status about whether it has overflow in backward, if overflow, the step will do nothing? but if we call backward multi-times, the the should-really-step will be corrupted?)

Currenlty, the fp32 training is ok (about 9 Epochs); amp O1 failed at Epoch 1

@mcarilli could you please take some time to process this? I'm waiting online, Thanks

Hey, has anyone processing this problem?

I have tried:

1. look at the amp imp coarsely.
   I found if overflow, it seems just set the grads to Zero. so the `step` can't cause NAN.  so previous `P2` is impossible.

2. I run this code in SingleGPU, it's OK after 1 night. (but still in Epoch1) it has some Overflow, but just training fine. no Nan/Inf.

3. from `fairseq`, I saw it set the `min_loss_scale`, and then I see the `max_loss_scale` which I haven't ever concern. I set the `max_loss_scale == 2**16 (as the default init value, it don't overflow)` and `min_loss_scale=1`; Under this setting, I run  the 8 card training 1 night(In Epoch 4); Now it works fine, seems No any Overflow.

Over all, I think the previous P1 is dangerous and probably cause the Problem? what's more, I'm also concern the NLL loss in fp16 mode will cause Overflow/INF/NAN, but in 2, 3, it haven't occurs.

May be some suggestion or conclusion?

——————

Oh, I Finally figure it out.

Let me summarize it:

the Amp has_overflow state haven't synced across all progress when multi-card traning.
so when 1 card overflow, it will skip it correctly, BUT the other process it not overflow will not!
because we have sync the grads, so the grads have invalid if any process overflow. so the finally total messed up.

**How Amp do when come across a overflow? ** It just hack the optimizer's step function! to a new function, it print the Gradient Overflow message and then change the step back.

**How to deal with this situation? ** It is easy, just sync all the Amp state across all process.

my code is based on OpenNMT-py, the sync state is easy, but we need to get whether amp is overflow and hack the step as the amp does. I just copy and modifying the amp source code, as following:

        def _step():
            """step function wrapper. totally safe for fp32"""
            def _multiprocess_sync_amp_is_overflow():
                """need sync optimizer is-overflow state when multiprocessing"""
                if self.args.model_dtype != "fp16":
                    return False
                # get current process amp state
                local_overflow_cnt = 0
                for o in self.optims:
                    if o.optimizer._amp_stash.already_patched:
                        local_overflow_cnt = 1
                        break
                # print(f"Device {self.gpu_rank} local overflow state: {local_overflow_cnt}")
                # Sync the global overflow
                global_overflow_cnt = local_overflow_cnt
                if self.n_gpu > 1:
                    global_overflow_cnt = sum(distributed.all_gather_list(global_overflow_cnt))
                # print(f"Gloal overflow state: {global_overflow_cnt}")
                is_global_overflow = global_overflow_cnt > 0
                return is_global_overflow

            def patch_step(opt):
                """this function is copied from apex"""
                opt_step = opt.step
                def skip_step(closure=None):
                    if closure is not None:
                        raise RuntimeError("Currently, Amp does not support closure use with optimizers.")
                    logger.info(f"Device[{self.gpu_rank}] Gradient overflow. Skipping step. "
                            "(This is from hack-for-optimizer-sync)")
                    if hasattr(opt._amp_stash, "all_fp32_from_fp16_params"):
                        # Clear the master grads that wouldn't be zeroed by model.zero_grad()
                        for param in opt._amp_stash.all_fp32_from_fp16_params:
                            param.grad = None
                    if hasattr(opt, "most_recent_scale"):
                        opt.most_recent_scale = 1.0
                        opt.scale_set_by_backward = False
                    opt.step = opt_step
                    opt._amp_stash.already_patched = False
                return skip_step

            if self.n_gpu > 1:
                is_global_overflow = _multiprocess_sync_amp_is_overflow()
                if is_global_overflow:
                    # hack the optimizer
                    for o in self.optims:
                        if o.optimizer._amp_stash.already_patched:
                            continue
                        o.optimizer.step = patch_step(o.optimizer)
                        o.optimizer._amp_stash.already_patched = True

previous code solve my problem.
Good Luck.

So where did you put that piece of code? which file in the apex folder

Hey, I got the same ZeroDevisionProblem as following:

Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 262144.0                                      
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 262144.0                                      
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8192.0                                        
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8192.0                                        
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 256.0                                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 256.0                                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8.0                                           
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 8.0                                           
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.25                                          
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.25  
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.0078125                                     
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.0078125                                     
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.000244140625                                
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.000244140625                                
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.62939453125e-06                             
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.62939453125e-06                             
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.384185791015625e-07                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.384185791015625e-07                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.450580596923828e-09                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.450580596923828e-09                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.3283064365386963e-10                        
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.3283064365386963e-10                        
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.275957614183426e-12                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.275957614183426e-12                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.2737367544323206e-13                        
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.2737367544323206e-13                        
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.105427357601002e-15                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 7.105427357601002e-15                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.220446049250313e-16                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.220446049250313e-16                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.938893903907228e-18                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.938893903907228e-18                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.168404344971009e-19                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 2.168404344971009e-19                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.776263578034403e-21                         
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 6.776263578034403e-21 

then I got Nan

step 2900/210000 (669 example/last step); acc:   0.00; ppl:   nan
; xent:  nan; lr: 0.00000191;   0/26651 tok/s;   5468 sec   
-- many iterations ----
[Logger(3)] [2020-03-20 03:55:42,898 WARNING] NaN or Inf found in input tensor. 

then after may Nan iteration, It finally raise

  File "/opt/conda/envs/py36/lib/python3.6/site-packages/apex/amp/_process_optimizer.py", line 135, in post_backward_m
odels_are_masters                                                                                                     
    scale_override=(grads_have_scale, stashed_have_scale, out_scale))                                                 
  File "/opt/conda/envs/py36/lib/python3.6/site-packages/apex/amp/scaler.py", line 176, in unscale_with_stashed       
    out_scale/grads_have_scale,   # 1./scale,                                                                         
ZeroDivisionError: float division by zero    

Our Usage:

1. multi-cards training with NCCL (as normal)

2. gradient accumulative, like the demo use:
for batch in accumulated_batch:
   # forward
   # loss = creterion(xxx)
    with amp.scale_loss(loss, optimizer) as scaled_loss:
        scaled_loss.backward()

if self.grad_accum_count > 1:
       if self.n_gpu > 1:
            grads = [p.grad.data for p in self.model.parameters()
                         if p.requires_grad
                         and p.grad is not None]
           # NOTE, P1: SYNC multi-cards
            distributed.all_reduce_and_rescale_tensors(
                    grads, float(1))
# NOTE, P2: do step
optimizer.step()
optimizer.zero_grad()

I'm worry the preious P1 and P2 can cause problem:

1. P1: if each card don't overflow, but after redunce sum, can it be Overflow?

2. P2: if 1 batch in accumulated-batch overflow, but other batch is ok, the final `optimizer.step` can cause overflow (I thought apm will track the backward status about whether it has overflow in backward, if overflow, the step will do nothing? but if we call backward multi-times, the the should-really-step will be corrupted?)

Currenlty, the fp32 training is ok (about 9 Epochs); amp O1 failed at Epoch 1
@mcarilli could you please take some time to process this? I'm waiting online, Thanks
Hey, has anyone processing this problem?
I have tried:

1. look at the amp imp coarsely.
   I found if overflow, it seems just set the grads to Zero. so the `step` can't cause NAN.  so previous `P2` is impossible.

2. I run this code in SingleGPU, it's OK after 1 night. (but still in Epoch1) it has some Overflow, but just training fine. no Nan/Inf.

3. from `fairseq`, I saw it set the `min_loss_scale`, and then I see the `max_loss_scale` which I haven't ever concern. I set the `max_loss_scale == 2**16 (as the default init value, it don't overflow)` and `min_loss_scale=1`; Under this setting, I run  the 8 card training 1 night(In Epoch 4); Now it works fine, seems No any Overflow.

Over all, I think the previous P1 is dangerous and probably cause the Problem? what's more, I'm also concern the NLL loss in fp16 mode will cause Overflow/INF/NAN, but in 2, 3, it haven't occurs.
May be some suggestion or conclusion?
——————
Oh, I Finally figure it out.
Let me summarize it:
the Amp has_overflow state haven't synced across all progress when multi-card traning.
so when 1 card overflow, it will skip it correctly, BUT the other process it not overflow will not!
because we have sync the grads, so the grads have invalid if any process overflow. so the finally total messed up.
**How Amp do when come across a overflow? ** It just hack the optimizer's step function! to a new function, it print the Gradient Overflow message and then change the step back.
**How to deal with this situation? ** It is easy, just sync all the Amp state across all process.
my code is based on OpenNMT-py, the sync state is easy, but we need to get whether amp is overflow and hack the step as the amp does. I just copy and modifying the amp source code, as following:

        def _step():
            """step function wrapper. totally safe for fp32"""
            def _multiprocess_sync_amp_is_overflow():
                """need sync optimizer is-overflow state when multiprocessing"""
                if self.args.model_dtype != "fp16":
                    return False
                # get current process amp state
                local_overflow_cnt = 0
                for o in self.optims:
                    if o.optimizer._amp_stash.already_patched:
                        local_overflow_cnt = 1
                        break
                # print(f"Device {self.gpu_rank} local overflow state: {local_overflow_cnt}")
                # Sync the global overflow
                global_overflow_cnt = local_overflow_cnt
                if self.n_gpu > 1:
                    global_overflow_cnt = sum(distributed.all_gather_list(global_overflow_cnt))
                # print(f"Gloal overflow state: {global_overflow_cnt}")
                is_global_overflow = global_overflow_cnt > 0
                return is_global_overflow

            def patch_step(opt):
                """this function is copied from apex"""
                opt_step = opt.step
                def skip_step(closure=None):
                    if closure is not None:
                        raise RuntimeError("Currently, Amp does not support closure use with optimizers.")
                    logger.info(f"Device[{self.gpu_rank}] Gradient overflow. Skipping step. "
                            "(This is from hack-for-optimizer-sync)")
                    if hasattr(opt._amp_stash, "all_fp32_from_fp16_params"):
                        # Clear the master grads that wouldn't be zeroed by model.zero_grad()
                        for param in opt._amp_stash.all_fp32_from_fp16_params:
                            param.grad = None
                    if hasattr(opt, "most_recent_scale"):
                        opt.most_recent_scale = 1.0
                        opt.scale_set_by_backward = False
                    opt.step = opt_step
                    opt._amp_stash.already_patched = False
                return skip_step

            if self.n_gpu > 1:
                is_global_overflow = _multiprocess_sync_amp_is_overflow()
                if is_global_overflow:
                    # hack the optimizer
                    for o in self.optims:
                        if o.optimizer._amp_stash.already_patched:
                            continue
                        o.optimizer.step = patch_step(o.optimizer)
                        o.optimizer._amp_stash.already_patched = True

previous code solve my problem.
Good Luck.

So where did you put that piece of code? which file in the apex folder

apex/amp/handle.py

I have encountered the same problem in fairseq;
the reason :
I want to train a model with large numbers of parameters.
the big model leads to oom, I have to reduce the batch size;
and the small batch size leads to nan or inf in the params.

a small amount of gradient overflow is acceptable, but in the end, the loss scale will quickly shrink to the threshold value, leading to training interruption.

my solution :
to set the update_freq to 4, expanding batch size in disguised form. it is waiting for verification.
and reduce the learning rate is another solution, but is invalid to me.