deepspeed 分布式训练 loss nan or inf
JohnTang93 opened this issue · comments
单机多卡训练正常,多机多卡报错
Skipping backward and optimizer step for nan or inf in forwarding metrics/loss!
尝试把--fp16换成--bf16
SwissArmyTransformer is a flexible and powerful library to develop your own Transformer variants.
JohnTang93 opened this issue · comments
单机多卡训练正常,多机多卡报错
Skipping backward and optimizer step for nan or inf in forwarding metrics/loss!
尝试把--fp16换成--bf16