Zasder3 / train-CLIP

A PyTorch Lightning solution to training OpenAI's CLIP from scratch.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

manual_backward + fp16 training doesn't converge

DrJimFan opened this issue · comments

Hi, I borrowed some snippets from your codebase for the distributed GPU and minibatch-within-batch training in my own project. However, I found that training using manual_backward() + FP16 does not converge at all. If I switch to FP32, training works without any other code modifications. I'm using the latest pytorch-lightning v1.6.3. I wonder if you have observed similar issues?

I saw something similar, fwiw -- exploding gradients in the gradient rescaling from the very first forward pass. I read in other threads online that this is somewhat common in transformer architectures, especially those that include parameters smaller than the smallest possible 16bit float -- 6.1e-5, which is I guess not uncommon.