mlfoundations / open_lm

A repository for research on medium sized language models.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Use no_sync when doing gradient accumulation

achalddave opened this issue · comments

pytorch/pytorch#72446

By default, FSDP will reduce gradients on every backward() call, which is slow in multi node settings. We should use fsdp.no_sync() to only reduce gradients on the last backward call.