huggingface / nanotron

Minimalistic large language model 3D-parallelism training

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Merging optimizer states from different pipeline parallel size to resume training

xrsrke opened this issue · comments

Suppose you start training with a pipeline parallel size of 4. We need to make it supports resuming training with a different pipeline parallel size, like 2, by merging optimizer states.