Running into NaN outputs of plan_recognition_net

Question

Running into NaN outputs of plan_recognition_net

RamyE opened this issue 9 months ago · comments

I am using all default configs (input of static RGB image and Proprio data only). I started getting NaN values that crash the training in epoch 3 (and once in epoch 9). Restarting from the prior epoch never helps. I always have to start training from scratch. I tried debugging this but could not figure it out. Has anyone run into this issue before?

ValueError: Expected parameter loc (Tensor of shape (32, 256)) of distribution Normal(loc: torch.Size([32, 256]), scale: torch.Size([32, 256])) to satisfy the constraint Real(), but found invalid values:

File "/---/calvin/calvin_models/calvin_agent/models/mcil.py", line 278, in training_step
    kl, act_loss, mod_loss, pp_dist, pr_dist = self.lmp_train(
  File "/---/calvin/calvin_models/calvin_agent/models/mcil.py", line 134, in lmp_train
    pr_dist = self.plan_recognition(perceptual_emb)  # (batch, 256) each
  File "/---/calvin/calvin_models/calvin_agent/models/plan_encoders/plan_recognition_net.py", line 58, in __call__
    pr_dist = Independent(Normal(mean, std), 1)

Oier Mees · Answer 1 · Fri Dec 29 2023 17:30:25 GMT+0800 (China Standard Time)

hmm that's is weird, what batch size are you using? Are you using the same hyperparams and GPU settings?
We used Pytorch Lightnings DDP implementation to scale our training to 8x NVIDIA GPUs with 12GB memory each. Thus, as each GPU receives a batch of 64 sequences (32 language + 32 vision), the effective batch size is 512 for all our experiments. If you have a different batch size, you might need to tune your learning rate to stabilize your train.