scaling up the loss before calculating gradient

Question

scaling up the loss before calculating gradient

hanit92 opened this issue 4 years ago · comments

Hi,
first of all, great paper and great code, thank you for sharing it :)
I was wondering - why do you scale up the loss before the backward() call (multiplying by 1024.), and then dividing it again before the weights update?

Liang Liu · Answer 1 · Tue Sep 08 2020 09:37:44 GMT+0800 (China Standard Time)

It's a historical part from mixed precision training. I don't remember if it had any effect on the results in this project.
You can refer to this link for the purpose of this part.