[BUG] FP8+PP+Recompute+GA>1, loss = nan

Question

[BUG] FP8+PP+Recompute+GA>1, loss = nan

jingjie01ai opened this issue 6 months ago · comments

Describe the bug
FP8+PP+Recompute+GA>1, loss = nan
FP8+PP+GA>1, loss is normal
FP8+PP+Recompute+GA=1, loss is normal
FP8+TP+Recompute+GA>1, loss is normal

jingjie01ai · Answer 1 · Wed Nov 29 2023 21:20:52 GMT+0800 (China Standard Time)

refer: NVIDIA/TransformerEngine#539