Support for distributed finetuning
Mittagskogel opened this issue · comments
Mittagskogel commented
I would like to restart from the provided checkpoints (https://github.com/microsoft/Megatron-DeepSpeed#downloading-checkpoints) and do distributed finetuning. These checkpoints are single-process and have no arguments saved with them, so I need to convert the checkpoints to some parallel format:
- The script https://github.com/microsoft/Megatron-DeepSpeed#evaluation-and-tasks is hopelessly broken, failing due to the lack of arguments. If I fix that issue, I run into several more related to the improper initialization of megatron.
- https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/universal_checkpointing has no support for Zero3.
How should I proceed?