Support for distributed finetuning

Question

Support for distributed finetuning

Mittagskogel opened this issue 10 months ago · comments

I would like to restart from the provided checkpoints (https://github.com/microsoft/Megatron-DeepSpeed#downloading-checkpoints) and do distributed finetuning. These checkpoints are single-process and have no arguments saved with them, so I need to convert the checkpoints to some parallel format:

The script https://github.com/microsoft/Megatron-DeepSpeed#evaluation-and-tasks is hopelessly broken, failing due to the lack of arguments. If I fix that issue, I run into several more related to the improper initialization of megatron.
https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/universal_checkpointing has no support for Zero3.

How should I proceed?