bigscience-workshop / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

About convert DS checkpoint to Transformers

misska1 opened this issue · comments

 python tools/convert_checkpoint/deepspeed_to_megatron.py --target_tp 1 --target_pp 1 --input_folder checkpoints/tr11b-1B3-ml/checkpoints/main/global_step1/ --output_folder ./trans_checkpoints
Convert DeepSpeed Checkpoint to Megatron Checkpoint
args = Namespace(for_release=False, input_folder='checkpoints/tr11b-1B3-ml/checkpoints/main/global_step1/', output_folder='./trans_checkpoints', target_pp=1, target_tp=1)
Converting DeepSpeed checkpoint in checkpoints/tr11b-1B3-ml/checkpoints/main/global_step1/ to Megatron checkpoint in ./trans_checkpoints
Traceback (most recent call last):
  File "tools/convert_checkpoint/deepspeed_to_megatron.py", line 187, in <module>
    main()
  File "tools/convert_checkpoint/deepspeed_to_megatron.py", line 173, in main
    ds_checkpoint = DeepSpeedCheckpoint(args.input_folder, args.target_tp,
  File "/data/anaconda3/envs/ds/lib/python3.8/site-packages/deepspeed/checkpoint/deepspeed_checkpoint.py", line 72, in __init__
    self.zero_checkpoint = ZeROCheckpoint(dir)
  File "/data/anaconda3/envs/ds/lib/python3.8/site-packages/deepspeed/checkpoint/zero_checkpoint.py", line 26, in __init__
    assert self.num_files > 0, f'No ZeRO files found in {dir}'
AssertionError: No ZeRO files found in checkpoints/tr11b-1B3-ml/checkpoints/main/global_step1/

I did not get any zero file while saving checkpoint in pretrain.

I get this problem too, how to slove?

I met the same problem. I guess the options ZERO_STAGE=0 and --fp16 cannot work together? It cannot generate any ZeRO files.
But I don't know how to solve it.