THUDM / GLM

GLM (General Language Model)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

基于10B模型继续预训练,遇到world size 不一致导致报错

JinmingZhao opened this issue · comments

/data/pretrained_model/glm-10b-chinese-mp4/80000/mp_rank_03_model_states.pt.
Traceback (most recent call last):
File "pretrain_glm.py", line 673, in
main()
File "pretrain_glm.py", line 584, in main
args.iteration = load_checkpoint(model, optimizer, lr_scheduler, args, no_deepspeed=args.no_deepspeed_load)
File "/data/users/zhaojinming/source/glm10BCodesSlurmN1/utils.py", line 339, in load_checkpoint
[2023-03-21 20:28:05,674] [INFO] [torch_checkpoint_engine.py:21:load] [Torch] Loading checkpoint from /home/zhaojinming/data/pretrained_model/glm-10b-chinese-mp4/80000/mp_rank_00_model_states.pt...
load_lr_scheduler_states=not args.no_load_lr_scheduler)
File "/home/zhaojinming/data/miniconda3/envs/glm10b/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 2780, in load_checkpoint
load_optimizer_states=load_optimizer_states)
File "/home/zhaojinming/data/miniconda3/envs/glm10b/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 2950, in _load_zero_checkpoint
raise ZeRORuntimeException("The checkpoint being loaded used a DP "
deepspeed.runtime.zero.utils.ZeRORuntimeException: The checkpoint being loaded used a DP world size of 256 but the current world size is 2. Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported.
g2001:34292:39682 [7] NCCL INFO [Service thread] Connection closed by localRank 7
g2001:34292:34292 [7] NCCL INFO comm 0x56092c1a6180 rank 7 nranks 8 cudaDev 7 busId c9000 - Abort COMPLETE
g2001:34292:34952 [7] NCCL INFO [Service thread] Connection closed by localRank 3
g2001:34292:34292 [7] NCCL INFO comm 0x560915b1e7b0 rank 3 nranks 4 cudaDev 7 busId c9000 - Abort COMPLETE
[2023-03-21 20:28:07,098] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 34285
[2023-03-21 20:28:07,370] [INFO] [torch_checkpoint_engine.py:23:load] [Torch] Loaded checkpoint from /home/zhaojinming/data/pretrained_model/glm-10b-chinese-mp4/80000/mp_rank_00_model_states.pt.
[2023-03-21 20:28:02,531] [INFO] [torch_checkpoint_engine.py:21:load] [Torch] Loading checkpoint from /home/zhaojinming/data/pretrained_model/glm-10b-chinese-mp4/80000/mp_rank_03_model_states.pt...
[2023-03-21 20:28:02,724] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 34286
[2023-03-21 20:28:03,040] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 34287
[2023-03-21 20:27:58,893] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 34288
[2023-03-21 20:27:59,111] [INFO] [torch_checkpoint_engine.py:21:load] [Torch] Loading checkpoint from /home/zhaojinming/data/pretrained_model/glm-10b-chinese-mp4/80000/mp_rank_00_model_states.pt...
[2023-03-21 20:27:59,225] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 34289
[2023-03-21 20:27:59,583] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 34290
[2023-03-21 20:27:59,996] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 34291
[2023-03-21 20:28:00,875] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 34292

same error, has been solved? @JinmingZhao

/data/pretrained_model/glm-10b-chinese-mp4/80000/mp_rank_03_model_states.pt. Traceback (most recent call last): File "pretrain_glm.py", line 673, in main() File "pretrain_glm.py", line 584, in main args.iteration = load_checkpoint(model, optimizer, lr_scheduler, args, no_deepspeed=args.no_deepspeed_load) File "/data/users/zhaojinming/source/glm10BCodesSlurmN1/utils.py", line 339, in load_checkpoint [2023-03-21 20:28:05,674] [INFO] [torch_checkpoint_engine.py:21:load] [Torch] Loading checkpoint from /home/zhaojinming/data/pretrained_model/glm-10b-chinese-mp4/80000/mp_rank_00_model_states.pt... load_lr_scheduler_states=not args.no_load_lr_scheduler) File "/home/zhaojinming/data/miniconda3/envs/glm10b/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 2780, in load_checkpoint load_optimizer_states=load_optimizer_states) File "/home/zhaojinming/data/miniconda3/envs/glm10b/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 2950, in _load_zero_checkpoint raise ZeRORuntimeException("The checkpoint being loaded used a DP " deepspeed.runtime.zero.utils.ZeRORuntimeException: The checkpoint being loaded used a DP world size of 256 but the current world size is 2. Automatic adjustment of ZeRO's optimizer state partitioning with a new world size is not currently supported. g2001:34292:39682 [7] NCCL INFO [Service thread] Connection closed by localRank 7 g2001:34292:34292 [7] NCCL INFO comm 0x56092c1a6180 rank 7 nranks 8 cudaDev 7 busId c9000 - Abort COMPLETE g2001:34292:34952 [7] NCCL INFO [Service thread] Connection closed by localRank 3 g2001:34292:34292 [7] NCCL INFO comm 0x560915b1e7b0 rank 3 nranks 4 cudaDev 7 busId c9000 - Abort COMPLETE [2023-03-21 20:28:07,098] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 34285 [2023-03-21 20:28:07,370] [INFO] [torch_checkpoint_engine.py:23:load] [Torch] Loaded checkpoint from /home/zhaojinming/data/pretrained_model/glm-10b-chinese-mp4/80000/mp_rank_00_model_states.pt. [2023-03-21 20:28:02,531] [INFO] [torch_checkpoint_engine.py:21:load] [Torch] Loading checkpoint from /home/zhaojinming/data/pretrained_model/glm-10b-chinese-mp4/80000/mp_rank_03_model_states.pt... [2023-03-21 20:28:02,724] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 34286 [2023-03-21 20:28:03,040] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 34287 [2023-03-21 20:27:58,893] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 34288 [2023-03-21 20:27:59,111] [INFO] [torch_checkpoint_engine.py:21:load] [Torch] Loading checkpoint from /home/zhaojinming/data/pretrained_model/glm-10b-chinese-mp4/80000/mp_rank_00_model_states.pt... [2023-03-21 20:27:59,225] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 34289 [2023-03-21 20:27:59,583] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 34290 [2023-03-21 20:27:59,996] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 34291 [2023-03-21 20:28:00,875] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 34292

请问您的预训练语料格式是什么