HuangLK / transpeeder

train llama on a single A100 80G node using 🤗 transformers and 🚀 Deepspeed Pipeline Parallelism

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

File not found error

AlvL1225 opened this issue · comments

commented

Hi Huang, nice work!

when I tried to train with a 13B model, I got the error:
[Errno 2] No such file or directory: 'llama_13b_pp/global_step001/zero_pp_rank_0_mp_rank_03_optim_states.pt'

Any ideas on this? The 'convert2ckpt.py' script does not generate files with prefix 'zero_pp_....'

add load_optimizer_states=False and load_lr_scheduler_states=False while loading checkpoint.

    engine.load_checkpoint(model_args.init_ckpt,
                           load_module_only=True,
                           load_optimizer_states=False,
                           load_lr_scheduler_states=False,
    

In addition, I modified the way of loading checkpoint, so that it can skip zero_pp_xxx. See this commit

commented

In addition, I modified the way of loading checkpoint, so that it can skip zero_pp_xxx. See this commit

thanks! I have another question. When I tried to use pp4dp2 in a 8xA100 node, I encountered

RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.when initializing engine
Do you have any idea on this?