microsoft / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support for distributed finetuning

Mittagskogel opened this issue · comments

I would like to restart from the provided checkpoints (https://github.com/microsoft/Megatron-DeepSpeed#downloading-checkpoints) and do distributed finetuning. These checkpoints are single-process and have no arguments saved with them, so I need to convert the checkpoints to some parallel format:

How should I proceed?