manual_backward + fp16 training doesn't converge

Question

manual_backward + fp16 training doesn't converge

DrJimFan opened this issue 2 years ago · comments

Hi, I borrowed some snippets from your codebase for the distributed GPU and minibatch-within-batch training in my own project. However, I found that training using manual_backward() + FP16 does not converge at all. If I switch to FP32, training works without any other code modifications. I'm using the latest pytorch-lightning v1.6.3. I wonder if you have observed similar issues?

R Zach · Answer 1 · Sat Sep 24 2022 02:42:05 GMT+0800 (China Standard Time)

I saw something similar, fwiw -- exploding gradients in the gradient rescaling from the very first forward pass. I read in other threads online that this is somewhat common in transformer architectures, especially those that include parameters smaller than the smallest possible 16bit float -- 6.1e-5, which is I guess not uncommon.