4*4090gpu for int4 model inference error

Question

4*4090gpu for int4 model inference error

sukibean163 opened this issue a year ago · comments

你好，使用4个4090gpu卡量化推理，产生错误是什么原因呢？其中量化模型目录和model_glm_130b_4090_int4.sh内容如下：

int4模型和脚本

模型

1 directory, 5 files

model_glm_130b_4090_int4.sh脚本

MODEL_TYPE="glm-130b"
CHECKPOINT_PATH="THUDM/chatglm-130b-int4"
MP_SIZE=4
MODEL_ARGS="--model-parallel-size ${MP_SIZE}
--num-layers 70
--hidden-size 12288
--inner-hidden-size 32768
--vocab-size 150528
--num-attention-heads 96
--max-sequence-length 2048
--tokenizer-type icetk-glm-130B
--layernorm-order post
--quantization-bit-width 4
--load ${CHECKPOINT_PATH}
--skip-init
--fp16
--bminf
--from-quantized-checkpoint
--bminf-memory-limit 24"

error

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 9587) of binary: /home/sukibean/bin/python
Traceback (most recent call last):
File "/home/sukibean/bin/torchrun", line 8, in
sys.exit(main())
File "/home/sukibean/.local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/sukibean/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/sukibean/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/sukibean/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/sukibean/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/home/sukibean/work/src/python/GLM-130B-main/generate.py FAILED

Failures:
[1]:
time : 2023-06-09_03:52:08
host : 3046826faa76
rank : 1 (local_rank: 1)
exitcode : -7 (pid: 9588)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 9588
[2]:
time : 2023-06-09_03:52:08
host : 3046826faa76
rank : 2 (local_rank: 2)
exitcode : -7 (pid: 9589)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 9589
[3]:
time : 2023-06-09_03:52:08
host : 3046826faa76
rank : 3 (local_rank: 3)
exitcode : -7 (pid: 9590)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 9590

Root Cause (first observed failure):
[0]:
time : 2023-06-09_03:52:08
host : 3046826faa76
rank : 0 (local_rank: 0)
exitcode : -7 (pid: 9587)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 9587

kam · Answer 1 · Tue Jul 11 2023 13:36:58 GMT+0800 (China Standard Time)

遇到同样的问题，求助，貌似是多卡问题导致？