Merging optimizer states from different pipeline parallel size to resume training
xrsrke opened this issue · comments
XλRI-U5 commented
Suppose you start training with a pipeline parallel size of 4. We need to make it supports resuming training with a different pipeline parallel size, like 2, by merging optimizer states.