THUDM / GLM

GLM (General Language Model)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

继续预训练:加载切分好的2b模型时报错找不到zero_pp_rank_0_mp_rank_00_optim_states.pt

shuangt opened this issue · comments

好几个报错不知道怎么回事:
1、准备数据的时候在data_utils/init.py中hang住:torch.distributed.barrier(),注释掉这句可以运行
2、load模型继续预训练的时候即时设置“--no-load-optim”,仍然报错:file not found zero_pp_rank_0_mp_rank_00_optim_states.pt
3、不加载预训练模型的前提下,预训练hang在pretrain_glm.py的iteration, skipped = train(model, optimizer,...)函数,一夜没动静,命令行最终停止在这:
172.16.10.11: [2023-04-13 21:10:08,253] [INFO] [checkpointing.py:553:forward] Activation Checkpointing Information 172.16.10.11: [2023-04-13 21:10:08,254] [INFO] [checkpointing.py:554:forward] ----Partition Activations False, CPU CHECKPOINTING False 172.16.10.11: [2023-04-13 21:10:08,254] [INFO] [checkpointing.py:557:forward] ----contiguous Memory Checkpointing False with 36 total layers 172.16.10.11: [2023-04-13 21:10:08,254] [INFO] [checkpointing.py:560:forward] ----Synchronization False 172.16.10.11: [2023-04-13 21:10:08,254] [INFO] [checkpointing.py:561:forward] ----Profiling time in checkpointing False