File not found error

Question

File not found error

AlvL1225 opened this issue a year ago · comments

Hi Huang, nice work!

when I tried to train with a 13B model, I got the error:
[Errno 2] No such file or directory: 'llama_13b_pp/global_step001/zero_pp_rank_0_mp_rank_03_optim_states.pt'

Any ideas on this? The 'convert2ckpt.py' script does not generate files with prefix 'zero_pp_....'

Huanglk · Answer 1 · Wed May 17 2023 15:03:48 GMT+0800 (China Standard Time)

add load_optimizer_states=False and load_lr_scheduler_states=False while loading checkpoint.

    engine.load_checkpoint(model_args.init_ckpt,
                           load_module_only=True,
                           load_optimizer_states=False,
                           load_lr_scheduler_states=False,

Huanglk · Answer 2 · Wed May 17 2023 15:06:32 GMT+0800 (China Standard Time)

In addition, I modified the way of loading checkpoint, so that it can skip zero_pp_xxx. See this commit

AlvL · Answer 3 · Wed May 17 2023 15:33:17 GMT+0800 (China Standard Time)

In addition, I modified the way of loading checkpoint, so that it can skip zero_pp_xxx. See this commit

thanks! I have another question. When I tried to use pp4dp2 in a 8xA100 node, I encountered

RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.when initializing engine
Do you have any idea on this?