Encountered gradient overflow issue during replication using Readme's experiment

Question

Encountered gradient overflow issue during replication using Readme's experiment

xyb314 opened this issue 7 months ago · comments

Hi there,

I'm quite interested in your work. However, I encountered some issues while replicating the project with the iwslt14_de_en mentioned in the Readme.

After training for an epoch, I frequently encounter gradient overflow similar to this:

2023-12-15 19:38:03 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.25
2023-12-15 19:38:05 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.125
2023-12-15 19:38:05 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.0625
...
2023-12-15 19:38:10 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.0001220703125

Subsequently, I encounter this error:

FloatingPointError: Minimum loss scale reached (0.0001). Your loss is probably exploding. Try lowering the learning rate, using gradient clipping, or increasing the batch size.

Even after reducing the learning rate, the issue persists. I'd like to seek guidance on how to address such an issue.

I'm working on a single NVIDIA GeForce RTX 3080 Ti.

Thank you for your work! Looking forward to your response!

Here is the complete traceback:

/home/yzh/anaconda3/envs/bisim/lib/python3.9/site-packages/torch/nn/modules/module.py:1117: UserWarning: Using a non-full backward hook when the forward contains multiple autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_input. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using a non-full backward hook when the forward contains multiple autograd Nodes "
/home/yzh/anaconda3/envs/bisim/lib/python3.9/site-packages/torch/nn/modules/module.py:1082: UserWarning: Using non-full backward hooks on a Module that does not return a single Tensor or a tuple of Tensors is deprecated and will be removed in future versions. This hook will be missing some of the grad_output. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using non-full backward hooks on a Module that does not return a "
Traceback (most recent call last):
  File "/sda/yzh/Bi-SimCut/fairseq/fairseq/trainer.py", line 930, in train_step
    grad_norm = self.clip_grad_norm(self.cfg.optimization.clip_norm)
  File "/sda/yzh/Bi-SimCut/fairseq/fairseq/trainer.py", line 1264, in clip_grad_norm
    return self.optimizer.clip_grad_norm(
  File "/sda/yzh/Bi-SimCut/fairseq/fairseq/optim/fp16_optimizer.py", line 201, in clip_grad_norm
    self.scaler.check_overflow(grad_norm)
  File "/sda/yzh/Bi-SimCut/fairseq/fairseq/optim/dynamic_loss_scaler.py", line 61, in check_overflow
    raise FloatingPointError(
FloatingPointError: Minimum loss scale reached (0.0001). Your loss is probably exploding. Try lowering the learning rate, using gradient clipping or increasing the batch size.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/yzh/anaconda3/envs/bisim/bin/fairseq-train", line 8, in <module>
    sys.exit(cli_main())
  File "/sda/yzh/Bi-SimCut/fairseq/fairseq_cli/train.py", line 557, in cli_main
    distributed_utils.call_main(cfg, main)
  File "/sda/yzh/Bi-SimCut/fairseq/fairseq/distributed/utils.py", line 369, in call_main
    main(cfg, **kwargs)
  File "/sda/yzh/Bi-SimCut/fairseq/fairseq_cli/train.py", line 190, in main
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/home/yzh/anaconda3/envs/bisim/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/sda/yzh/Bi-SimCut/fairseq/fairseq_cli/train.py", line 316, in train
    log_output = trainer.train_step(samples)
  File "/home/yzh/anaconda3/envs/bisim/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/sda/yzh/Bi-SimCut/fairseq/fairseq/trainer.py", line 977, in train_step
    self.task.train_step(
  File "/sda/yzh/Bi-SimCut/fairseq/fairseq/tasks/fairseq_task.py", line 517, in train_step
    loss, sample_size, logging_output = criterion(model, sample)
  File "/home/yzh/anaconda3/envs/bisim/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/sda/yzh/Bi-SimCut/fairseq/fairseq/criterions/label_smoothed_cross_entropy_with_simcut.py", line 96, in forward
    net_output = model(**sample["net_input"])
  File "/home/yzh/anaconda3/envs/bisim/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/sda/yzh/Bi-SimCut/fairseq/fairseq/models/transformer/transformer_base.py", line 144, in forward
    encoder_out = self.encoder(
  File "/home/yzh/anaconda3/envs/bisim/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1227, in _call_impl
    var = next((v for v in var.values() if isinstance(v, torch.Tensor)))
StopIteration

Pengzhi Gao · Answer 1 · Fri Dec 15 2023 22:05:23 GMT+0800 (China Standard Time)

Thanks for your interest in our work.

Did you use the same training configurations in the README file? It looks like there is some instability in the training procedure. Could you try without fp16 or increase the --update-freq?

xyb314 · Answer 2 · Sat Dec 16 2023 14:38:43 GMT+0800 (China Standard Time)

Thank you for your response.

Yes, I used the parameter settings in the README file, but encountered gradient explosions during both pretraining and finetuning. Perhaps it's due to different environments? (I'm not quite sure. I haven't figured out why yet. Methods like reducing the learning rate and increasing the batch size haven't stopped the gradient explosion.) May I please ask about your training environment?

Later, I attempted to remove the '--fp16' flag and successfully trained a bidirectional pretrained model for 300,000 updates (I set early stopping). However, it slowed down the training speed, decreasing from 7.5s/100 updates to around 18s/100 updates.

Additionally, in my recent attempt with '--update-freq' using smaller values (2, 4, 8), it only managed to delay the gradient explosion rather than preventing it altogether.

Pengzhi Gao · Answer 3 · Sat Dec 16 2023 18:52:42 GMT+0800 (China Standard Time)

I conducted my experiment on one NVIDIA Tesla V100, and my training environment at that time should be as follows:

Python version == 3.6.5
PyTorch version == 1.10.1
Fairseq version == 0.12.2

Maybe you could train an NMT model without fp16 at first to see if you could achieve a similar translation performance as I reported.

xyb314 · Answer 4 · Sun Dec 17 2023 12:07:43 GMT+0800 (China Standard Time)

Thank you for providing the environmental configuration.

Yes, later on, I removed the --fp16 flag for fine-tuning. I trained for 300,000 updates, taking approximately 1000 minutes, and achieved a Bleu Score of 37.58 on the test set. It seems that training without --fp16 was slower, thus my training wasn't as exhaustive within the same timeframe. Nevertheless, it's quite close to the results you published in Table 3 of the paper.

Additionally, I was using pytorch==1.12.0. Today, I tried switching to pytorch==1.10.1, and there were no gradient explosions in fp16 mode. It feels quite mysterious, lol.

In any case, I'm really grateful for your response and the information provided!