THUDM / GLM-130B

GLM-130B: An Open Bilingual Pre-Trained Model (ICLR 2023)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

4*4090gpu for int4 model inference error

sukibean163 opened this issue · comments

你好,使用4个4090gpu卡量化推理,产生错误是什么原因呢?其中量化模型目录和model_glm_130b_4090_int4.sh内容如下:

int4模型和脚本

模型

$ tree THUDM/chatglm-130b-int4/
THUDM/chatglm-130b-int4/
|-- 49300
| |-- mp_rank_00_model_states.pt
| |-- mp_rank_01_model_states.pt
| |-- mp_rank_02_model_states.pt
| -- mp_rank_03_model_states.pt -- latest

1 directory, 5 files

model_glm_130b_4090_int4.sh脚本

MODEL_TYPE="glm-130b"
CHECKPOINT_PATH="THUDM/chatglm-130b-int4"
MP_SIZE=4
MODEL_ARGS="--model-parallel-size ${MP_SIZE}
--num-layers 70
--hidden-size 12288
--inner-hidden-size 32768
--vocab-size 150528
--num-attention-heads 96
--max-sequence-length 2048
--tokenizer-type icetk-glm-130B
--layernorm-order post
--quantization-bit-width 4
--load ${CHECKPOINT_PATH}
--skip-init
--fp16
--bminf
--from-quantized-checkpoint
--bminf-memory-limit 24"

error

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 9587) of binary: /home/sukibean/bin/python
Traceback (most recent call last):
File "/home/sukibean/bin/torchrun", line 8, in
sys.exit(main())
File "/home/sukibean/.local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/sukibean/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/sukibean/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/sukibean/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/sukibean/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/home/sukibean/work/src/python/GLM-130B-main/generate.py FAILED

Failures:
[1]:
time : 2023-06-09_03:52:08
host : 3046826faa76
rank : 1 (local_rank: 1)
exitcode : -7 (pid: 9588)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 9588
[2]:
time : 2023-06-09_03:52:08
host : 3046826faa76
rank : 2 (local_rank: 2)
exitcode : -7 (pid: 9589)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 9589
[3]:
time : 2023-06-09_03:52:08
host : 3046826faa76
rank : 3 (local_rank: 3)
exitcode : -7 (pid: 9590)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 9590

Root Cause (first observed failure):
[0]:
time : 2023-06-09_03:52:08
host : 3046826faa76
rank : 0 (local_rank: 0)
exitcode : -7 (pid: 9587)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 9587

commented

遇到同样的问题,求助,貌似是多卡问题导致?