microsoft / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

how to convert deepspeed model to megatron, when pp=2, tp=2, nnode=2

lonelydancer opened this issue · comments

i train the model using 2 nodes, and copy machine1's model files to the machine2's directory.

image

then i use
python deepspeed_to_megatron.py --input_folder $checkpoint --output_folder output --target_tp 1 --target_pp 1
when i use
CHECKPOINT_PATH=dataset/checkpoints/gpt2_345m
#CHECKPOINT_PATH=ds_z2_nl24_hs512_gb128_mb8_tiny/
CHECKPOINT_PATH=/workspace/megatron/tools/convert_checkpoint/output
export CUDA_VISIBLE_DEVICES=3
VOCAB_FILE=dataset_tiny/gpt2-vocab.json
MERGE_FILE=dataset_tiny/gpt2-merges.txt

#VOCAB_FILE=data/gpt2-vocab.json
#MERGE_FILE=data/gpt2-merges.txt
export CUDA_DEVICE_MAX_CONNECTIONS=1
export MASTER_ADDR=localhost
export MASTER_PORT=1234
python tools/generate_samples_gpt.py
--tensor-model-parallel-size 2
--pipeline-model-parallel-size 2
--load $CHECKPOINT_PATH
--num-layers 24
--hidden-size 512
--num-attention-heads 16
--max-position-embeddings 1024
--fp16
--micro-batch-size 2
--seq-length 1024
--out-seq-length 1024
--temperature 1.0
--vocab-file $VOCAB_FILE
--merge-file $MERGE_FILE
--genfile unconditional_samples.json
--num-samples 2
--top_p 0.9
#--recompute
#--deepspeed
#--ds_inference
there is error

raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for ParallelTransformer:
Missing key(s) in state_dict: "final_layernorm.weight", "final_layernorm.bias".
Unexpected key(s) in state_dict: "layers.24.weight", "layers.24.bias", "final_layernorm.lm_head.weight"

I got the same issue. have you resolved it?