NaN or Inf found in input tensor.

Question

NaN or Inf found in input tensor.

bugsuse opened this issue a year ago · comments

Hi,
I was running train_autoenc.py with default hyperparameters and I encountered this error then stopped training.
Would you mind helping with this?

Epoch 11: : 1200it [21:02,  1.05s/it, loss=0.0638, v_num=0, val_loss=0.054, val_rec_loss=0.0454, val_kl_loss=0.819]                               Metric val_rec_loss improved by 0.010 >= min_delta = 0.0. New best score: 0.045                                                                     
Epoch 12: : 600it [11:53,  1.19s/it, loss=nan, v_num=0, val_loss=0.054, val_rec_loss=0.0454, val_kl_loss=0.819]NaN or Inf found in input tensor.
...
Epoch 12: : 1200it [20:38,  1.03s/it, loss=nan, v_num=0, val_loss=0.054, val_rec_loss=0.0454, val_kl_loss=0.819]NaN or Inf found in input tensor.
Epoch 12: : 1200it [20:40,  1.03s/it, loss=nan, v_num=0, val_loss=nan.0, val_rec_loss=nan.0, val_kl_loss=nan.0]                                   Monitored metric val_rec_loss = nan is not finite. Previous best value was 0.045. Signaling Trainer to stop.                                        
Epoch 12: : 1200it [20:40,  1.03s/it, loss=nan, v_num=0, val_loss=nan.0, val_rec_loss=nan.0, val_kl_loss=nan.0]

Jussi Leinonen · Answer 1 · Tue May 16 2023 16:33:12 GMT+0800 (China Standard Time)

This seems a bit strange. I'm trying to see on my system if I can reproduce it. Meanwhile, are you able to restart from a checkpoint and see if the problem occurs again?

Jussi Leinonen · Answer 2 · Tue May 16 2023 18:31:28 GMT+0800 (China Standard Time)

Meanwhile, in the above commit I added an easy option to continue training the autoencoder from a checkpoint (this option already existed for the diffusion model training).

Yang Li · Answer 3 · Wed May 17 2023 09:50:23 GMT+0800 (China Standard Time)

Thanks for the suggestions! @jleinonen

I restart from a checkpoint and the problem occurs again using the command below,

time python train_autoenc.py --ckpt_path lightning_logs/version_0/checkpoints/epoch\=11-val_rec_loss\=0.0454.ckpt

I'm trying to print some message for debugging it.

     (raw, _) = split.train_valid_test_split(raw, var, chunks=chunks)
         
     print('RZC scale: ', raw["train"][var]["scale"])
     variables[var]["transform"] = transform.default_rainrate_transform(
         raw["train"][var]["scale"]
     )

The output information is as follows,

Loading data...
RZC scale: [0.00000000e+00 3.52649689e-02 7.17734098e-02 1.09569430e-01
 1.48698330e-01 1.89207077e-01 2.31144428e-01 2.74560571e-01
 3.19507957e-01 3.66040230e-01 4.14213538e-01 4.64085698e-01
 5.15716553e-01 5.69168210e-01 6.24504805e-01 6.81792855e-01
 7.41101146e-01 8.02500963e-01 8.66065979e-01 9.31872606e-01
 1.00000000e+00 1.07052994e+00 1.14354682e+00 1.21913886e+00
 1.29739666e+00 1.37841415e+00 1.46228886e+00 1.54912114e+00
 1.63901591e+00 1.73208046e+00 1.82842708e+00 1.92817140e+00
 2.03143311e+00 2.13833642e+00 2.24900961e+00 2.36358571e+00
 2.48220229e+00 2.60500193e+00 2.73213196e+00 2.86374521e+00
 3.00000000e+00 3.14105988e+00 3.28709364e+00 3.43827772e+00
 3.59479332e+00 3.75682831e+00 3.92457771e+00 4.09824228e+00
 4.27803183e+00 4.46416092e+00 4.65685415e+00 4.85634279e+00
 5.06286621e+00 5.27667284e+00 5.49801922e+00 5.72717142e+00
 5.96440458e+00 6.21000385e+00 6.46426392e+00 6.72749043e+00
 7.00000000e+00 7.28211975e+00 7.57418728e+00 7.87655544e+00
 8.18958664e+00 8.51365662e+00 8.84915543e+00 9.19648457e+00
 9.55606365e+00 9.92832184e+00 1.03137083e+01 1.07126856e+01
 1.11257324e+01 1.15533457e+01 1.19960384e+01 1.24543428e+01
 1.29288092e+01 1.34200077e+01 1.39285278e+01 1.44549809e+01
 1.50000000e+01 1.55642395e+01 1.61483746e+01 1.67531109e+01
 1.73791733e+01 1.80273132e+01 1.86983109e+01 1.93929691e+01
 2.01121273e+01 2.08566437e+01 2.16274166e+01 2.24253712e+01
 2.32514648e+01 2.41066914e+01 2.49920769e+01 2.59086857e+01
 2.68576183e+01 2.78400154e+01 2.88570557e+01 2.99099617e+01
 3.10000000e+01 3.21284790e+01 3.32967491e+01 3.45062218e+01
 3.57583466e+01 3.70546265e+01 3.83966217e+01 3.97859383e+01
 4.12242546e+01 4.27132874e+01 4.42548332e+01 4.58507423e+01
 4.75029297e+01 4.92133827e+01 5.09841537e+01 5.28173714e+01
 5.47152367e+01 5.66800308e+01 5.87141113e+01 6.08199234e+01
 6.30000000e+01 6.52569580e+01 6.75934982e+01 7.00124435e+01
 7.25166931e+01 7.51092529e+01 7.77932434e+01 8.05718765e+01
 8.34485092e+01 8.64265747e+01 8.95096664e+01 9.27014847e+01
 9.60058594e+01 9.94267654e+01 1.02968307e+02 1.06634743e+02
 1.10430473e+02 1.14360062e+02 1.18428223e+02 1.22639847e+02
 1.27000000e+02 1.31513916e+02 1.36186996e+02 1.41024887e+02
 1.46033386e+02 1.51218506e+02 1.56586487e+02 1.62143753e+02
 1.67897018e+02 1.73853149e+02 1.80019333e+02 1.86402969e+02
 1.93011719e+02 1.99853531e+02 2.06936615e+02 2.14269485e+02
 2.21860947e+02 2.29720123e+02 2.37856445e+02 2.46279694e+02
 2.55000000e+02 2.64027832e+02 2.73373993e+02 2.83049774e+02
 2.93066772e+02 3.03437012e+02 3.14172974e+02 3.25287506e+02
 3.36794037e+02 3.48706299e+02 3.61038666e+02 3.73805939e+02
 3.87023438e+02 4.00707062e+02 4.14873230e+02 4.29538971e+02
 4.44721893e+02 4.60440247e+02 4.76712891e+02 4.93559387e+02
 5.11000000e+02 5.29055664e+02 5.47747986e+02 5.67099548e+02
 5.87133545e+02 6.07874023e+02 6.29345947e+02 6.51575012e+02
 6.74588074e+02 6.98412598e+02 7.23077332e+02 7.48611877e+02
 7.75046875e+02 8.02414124e+02 8.30746460e+02 8.60077942e+02
 8.90443787e+02 9.21880493e+02 9.54425781e+02 9.88118774e+02
 1.02300000e+03 1.05911133e+03 1.09649597e+03 1.13519910e+03
 1.17526709e+03 1.21674805e+03 1.25969189e+03 1.30415002e+03
 1.35017615e+03 1.39782520e+03 1.44715466e+03 1.49822375e+03
 1.55109375e+03 1.60582825e+03 1.66249292e+03 1.72115588e+03
 1.78188757e+03 1.84476099e+03 1.90985156e+03 1.97723755e+03
 2.04700000e+03 2.11922266e+03 2.19399194e+03 2.27139819e+03
 2.35153418e+03 2.43449609e+03 2.52038379e+03 2.60930005e+03
 2.70135229e+03 2.79665039e+03 2.89530933e+03 2.99744751e+03
 3.10318750e+03 3.21265649e+03 3.32598584e+03 3.44331177e+03
 3.56477515e+03 3.69052197e+03 3.82070312e+03 3.95547510e+03
 4.09500000e+03 4.23944531e+03 4.38898389e+03 4.54379639e+03
 4.70406836e+03 4.86999219e+03 5.04176758e+03 5.21960010e+03
 5.40370459e+03 5.59430078e+03 5.79161865e+03            nan
            nan            nan            nan            nan]
/public/home/ldcast/features/transform.py:80: RuntimeWarning: divide by zero encountered in log10
  log_scale = np.log10(scale).astype(np.float32)
Loading cached sampler from ../cache/sampler_autoenc_valid.pkl.
Loading cached sampler from ../cache/sampler_autoenc_test.pkl.
Loading cached sampler from ../cache/sampler_autoenc_train.pkl.

I found the RZC scale have NaN. Is it caused by this?

Jussi Leinonen · Answer 4 · Wed May 17 2023 17:10:18 GMT+0800 (China Standard Time)

The rain rates are stored as 8-bit unsigned int values that are then translated to physical values in mm/h using the scale array. It is true that the last elements of scale are left at nan but this is because these values should never occur in the 8-bit data. I have never seen a problem that the actual inputs to the training would contain nan, so I'm a bit puzzled by this. Could you verify by drawing samples from the datamodule and checking with e.g. np.isfinite(x).all()?

Meanwhile I re-ran the autoencoder training and I saw that around epoch 10 the training loss spikes. In the worst cases this can cause the loss to go to nan, while in other cases it recovers quickly. And it seems that after this happens once, it does not occur again. It's as if the network somehow reorganizes itself. I recall now that I found the same thing happening back in October-November when I was first training the autoencoder.

Yang Li · Answer 5 · Wed May 17 2023 18:09:03 GMT+0800 (China Standard Time)

The rain rates are stored as 8-bit unsigned int values that are then translated to physical values in mm/h using the scale array. It is true that the last elements of scale are left at nan but this is because these values should never occur in the 8-bit data. I have never seen a problem that the actual inputs to the training would contain nan, so I'm a bit puzzled by this. Could you verify by drawing samples from the datamodule and checking with e.g. np.isfinite(x).all()?

Thanks for the explanation and suggestion. I will try to check it.

Meanwhile I re-ran the autoencoder training and I saw that around epoch 10 the training loss spikes. In the worst cases this can cause the loss to go to nan, while in other cases it recovers quickly. And it seems that after this happens once, it does not occur again. It's as if the network somehow reorganizes itself. I recall now that I found the same thing happening back in October-November when I was first training the autoencoder.

I ran again to restart from a checkpoint using the command below, and it works fine now. It's really strange.

time python train_autoenc.py --ckpt_path lightning_logs/version_0/checkpoints/epoch\=11-val_rec_loss\=0.0454.ckpt

In addition, I guess this is maybe because log10(scale) is NaN when the scale is 0, then I replaced log_scale = np.log10(scale).astype(np.float32) in ldcast/features/transform.py to log_scale = np.log10(scale+1).astype(np.float32). I tried to run train_autoenc.py and also works fine.

Jussi Leinonen · Answer 6 · Wed May 17 2023 19:29:15 GMT+0800 (China Standard Time)

Note that a couple of lines below

log_scale = np.log10(scale).astype(np.float32)

we have

log_scale[~np.isfinite(log_scale)] = np.log10(fill_value)

which should ensure that all values in log_scale are non-NaN. (The default is fill_value=0 but in default_rainrate_transform it is set to 0.02, so it has a finite logarithm).

Yang Li · Answer 7 · Wed May 17 2023 21:49:28 GMT+0800 (China Standard Time)

Note that a couple of lines below
log_scale = np.log10(scale).astype(np.float32)
we have
log_scale[~np.isfinite(log_scale)] = np.log10(fill_value)
which should ensure that all values in log_scale are non-NaN. (The default is fill_value=0 but in default_rainrate_transform it is set to 0.02, so it has a finite logarithm).

Thanks for the explanation! @jleinonen