microsoft / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[BUG] FP8+PP+Recompute+GA>1, loss = nan

jingjie01ai opened this issue · comments

Describe the bug
FP8+PP+Recompute+GA>1, loss = nan
FP8+PP+GA>1, loss is normal
FP8+PP+Recompute+GA=1, loss is normal
FP8+TP+Recompute+GA>1, loss is normal