THUDM / SwissArmyTransformer

SwissArmyTransformer is a flexible and powerful library to develop your own Transformer variants.

Home Page:https://THUDM.github.io/SwissArmyTransformer

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

deepspeed 分布式训练 loss nan or inf

JohnTang93 opened this issue · comments

单机多卡训练正常,多机多卡报错

Skipping backward and optimizer step for nan or inf in forwarding metrics/loss!

尝试把--fp16换成--bf16