RuntimeError: Loss was NaN for 1 times in a row. Stopped training.

Question

RuntimeError: Loss was NaN for 1 times in a row. Stopped training.

Flash-Of-Thunder opened this issue 2 years ago · comments

Hello NeuralHydrology team! I was attempting to fit a LSTM to ~800 basins from the gages II dataset and got the following error after 1 epoch. The process to get the data is similar to successful runs using CAMELs basins, but these were selected based on land use characteristics (high agriculture). I have a feeling this is on my side but was wondering if this error code suggests any specific problems with my inputs. Also I am running a NH package from several months ago (perhaps I should update it). Happy to provide more specific details.

Full Error Message

# Epoch 2:  71%|███████   | 10815/15299 [06:41<02:46, 26.91it/s, Loss: 1.4837]
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/scratch/ials-gpu005/4320970/ipykernel_2363917/1118789532.py in <module>
     23         update_handler()
     24         os.chdir("/work/zhilyaev_umass_edu/data_dir")
---> 25         start_run(config_file=Path("/work/zhilyaev_umass_edu/data_dir/land_use_901_k12.yml"))
     26 
     27         # Repeat for no land use

/work/zhilyaev_umass_edu/neuralhydrology-master/neuralhydrology/nh_run.py in start_run(config_file, gpu)
     74         config.device = "cpu"
     75 
---> 76     start_training(config)
     77 
     78 

/work/zhilyaev_umass_edu/neuralhydrology-master/neuralhydrology/training/train.py in start_training(cfg)
     21         raise ValueError(f"Unknown head {cfg.head}.")
     22     trainer.initialize_training()
---> 23     trainer.train_and_validate()

/work/zhilyaev_umass_edu/neuralhydrology-master/neuralhydrology/training/basetrainer.py in train_and_validate(self)
    203                     param_group["lr"] = self.cfg.learning_rate[epoch]
    204 
--> 205             self._train_epoch(epoch=epoch)
    206             avg_loss = self.experiment_logger.summarise()
    207             LOGGER.info(f"Epoch {epoch} average loss: {avg_loss}")

/work/zhilyaev_umass_edu/neuralhydrology-master/neuralhydrology/training/basetrainer.py in _train_epoch(self, epoch)
    293                 nan_count += 1
    294                 if nan_count > self._allow_subsequent_nan_losses:
--> 295                     raise RuntimeError(f"Loss was NaN for {nan_count} times in a row. Stopped training.")
    296                 LOGGER.warning(f"Loss is Nan; ignoring step. (#{nan_count}/{self._allow_subsequent_nan_losses})")
    297             else:

RuntimeError: Loss was NaN for 1 times in a row. Stopped training.

Martin Gauch · Answer 1 · Wed Nov 23 2022 16:45:35 GMT+0800 (China Standard Time)

Hi @Flash-Of-Thunder,
Could you please share the .yml config file you used for training? From your description we can't tell what's wrong. One guess would be that the learning rate might just be too high, but again, at this point I'm just guessing.

Flash-Of-Thunder · Answer 2 · Wed Nov 23 2022 17:02:09 GMT+0800 (China Standard Time)

Here is the yml file that produced the error, I reran it without agricultural land dynamic inputs and it has been running fine so far (5 epochs)

# --- Experiment configurations --------------------------------------------------------------------

# experiment name, used as folder name
experiment_name: land_use_901_k12_2

# files to specify training, validation and test basins (relative to code root or absolute path)
train_basin_file: 901_basins_train.txt
validation_basin_file: 901_basins_train.txt
test_basin_file: 901_basins_test.txt

# training, validation and test time periods (format = 'dd/mm/yyyy')
train_start_date: "01/01/2008"
train_end_date: "31/12/2021"
validation_start_date: "01/01/2021"
validation_end_date: "31/12/2021"
test_start_date: "01/01/2008"
test_end_date: "30/12/2021"

# which GPU (id) to use [in format of cuda:0, cuda:1 etc, or cpu or None]
device: cuda:0

# --- Validation configuration ---------------------------------------------------------------------

# specify after how many epochs to perform validation
validate_every: 10

# specify how many random basins to use for validation
validate_n_random_basins: 530

# specify which metrics to calculate during validation (see neuralhydrology.evaluation.metrics)
# this can either be a list or a dictionary. If a dictionary is used, the inner keys must match the name of the
# target_variable specified below. Using dicts allows for different metrics per target variable.
metrics:
  - NSE

# --- Model configuration --------------------------------------------------------------------------

# base model type [lstm, ealstm, cudalstm, embcudalstm, mtslstm]
# (has to match the if statement in modelzoo/__init__.py)
model: cudalstm

# prediction head [regression]. Define the head specific parameters below
head: regression

# ----> Regression settings <----
output_activation: linear

# ----> General settings <----

# Number of cell states of the LSTM
hidden_size: 150

# Initial bias value of the forget gate
initial_forget_bias: 3

# Dropout applied to the output of the LSTM
output_dropout: 0.2

# --- Training configuration -----------------------------------------------------------------------

# specify optimizer [Adam]
optimizer: Adam

# specify loss [MSE, NSE, RMSE]
loss: NSE

# specify learning rates to use starting at specific epochs (0 is the initial learning rate)
learning_rate:
  0: 1e-2
  2: 5e-3
  3: 1e-3
  4: 1e-4  
  5: 1e-5

# Mini-batch size
batch_size: 256

# Number of training epochs
epochs: 5

# If a value, clips the gradients during training to that norm.
clip_gradient_norm: 1

# Defines which time steps are used to calculate the loss. Can't be larger than seq_length.
# If use_frequencies is used, this needs to be a dict mapping each frequency to a predict_last_n-value, else an int.
predict_last_n: 1

# Length of the input sequence
# If use_frequencies is used, this needs to be a dict mapping each frequency to a seq_length, else an int.
seq_length: 365

# Number of parallel workers used in the data pipeline
num_workers: 12

# Log the training loss every n steps
log_interval: 1

# If true, writes logging results into tensorboard file
log_tensorboard: True

# If a value and greater than 0, logs n random basins as figures during validation
log_n_figures: 1

# Save model weights every n epochs
save_weights_every: 1

# --- Data configurations --------------------------------------------------------------------------

# which data set to use [camels_us, camels_gb, global, hourly_camels_us]
dataset: generic

# Path to data set root
data_dir: /work/zhilyaev_umass_edu/data_dir


dynamic_inputs:
  - prcp
  - tmax
  - tmin
  - dayl
  - Corn
  - Soybeans
  - Other Hay_Non Alfalfa
  - Winter_Wheat
  - Spring_Wheat
  - Fallow
  - Alfalfa
  - Cotton
  - Sorghum
  - Other crops
  - Other tree crops

  
static_attributes:
  - DRAIN_SQKM
  - BAS_COMPACTNESS
  - ELEV_MEAN_M_BASIN
  - SLOPE_PCT
  - LAT_CENT
  - LONG_CENT
  - ELEV_MAX_M_BASIN
  - ELEV_MIN_M_BASIN
  - ROCKDEPAVE
  - STOR_NID_2009
  - DDENS_2009
  - MAJ_DDENS_2009
  
#  - T_MAX_BASIN
#  - PPTAVG_BASIN
#  - PPTAVG_SITE
#  - WD_BASIN
#  - WD_SITE
#  - SNOW_PCT_PRECIP
#  - ASPECT_DEGREES
#  - HGA
#  - HGB
#  - HGC
#  - HGD

# which columns to use as target
target_variables:
  - QObs

# clip negative predictions to zero for all variables listed below. Should be a list, even for single variables.
clip_targets_to_zero:
  - QObs

Martin Gauch · Answer 3 · Wed Nov 23 2022 17:11:51 GMT+0800 (China Standard Time)

1e-2 is a relatively large initial learning rate for Adam, especially given that you seem to have quite a lot of samples. I'd suggest you try running the experiment starting with 1e-3 (the Adam default) or even less and see if that helps.

You might also want to take a look at the loss curves in tensorboard. If they are very spiky, that's another indication that the learning rate is too high.

Flash-Of-Thunder · Answer 4 · Wed Nov 23 2022 17:19:43 GMT+0800 (China Standard Time)

Excellent, I'll try this. Any links or resources you'd recommend on learning how to accessing the loss curves in tensorboard?

Martin Gauch · Answer 5 · Wed Nov 23 2022 17:36:07 GMT+0800 (China Standard Time)

It's quite simple: you probably have tensorboard installed already (otherwise, run pip install tensorboard). Since your config has log_tensorboard: True, we log certain metrics to a file called something like events.out.tfevents....

Just execute tensorboard --logdir <path/to/run/>, which prints a URL where you can follow along with the visualizations of the stuff that we log while training progressees.

Looks something like this (note this is a generic image, not an image from NeuralHydrology):

For more details on tensorboard usage, just google it - there's plenty of resources.

Flash-Of-Thunder · Answer 6 · Sat Nov 26 2022 13:38:32 GMT+0800 (China Standard Time)

Thanks Martin! Lowering the learning rate solved the problem, 350+ epochs with no error. I'll explore these runs with the tensorboard log after the simulations are done.