模型检查到一半就报错，大佬能帮我看看吗

Question

模型检查到一半就报错，大佬能帮我看看吗

WXD7 opened this issue a year ago · comments

GLM) wxd7@wxd7-EG341W-G21:~/glm/Chatglm_lora_multi-gpu-main$ torchrun --nproc_per_node=2 multi_gpu_fintune_belle.py --dataset_path /home/wxd7/glm/ChatGLM-Tuning-master/data/alpaca --model_path /home/wxd7/upan/GLM/model/chatglm-6b --lora_rank 8 --per_device_train_batch_size 1 --gradient_accumulation_steps 1 --save_steps 2000 --save_total_limit 2 --learning_rate 2e-5 --fp16 --num_train_epochs 2 --remove_unused_columns false --logging_steps 50 --report_to wandb --output_dir output --deepspeed ds_config_zero3.json
WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

CUDA SETUP: CUDA runtime path found: /home/wxd7/anaconda3/envs/GLM/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 6.1
CUDA SETUP: Detected CUDA version 113
CUDA SETUP: CUDA runtime path found: /home/wxd7/anaconda3/envs/GLM/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 6.1
CUDA SETUP: Detected CUDA version 113
/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
warn(msg)
CUDA SETUP: Loading binary /home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so...
/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
warn(msg)
CUDA SETUP: Loading binary /home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so...
Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Traceback (most recent call last):
File "", line 1, in
FileNotFoundError: [Errno 2] No such file or directory: '/home/wxd7/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/aa51e62ddc9c9f334858b0af44cf59b05c70148a/tokenization_chatglm.py'
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Loading checkpoint shards: 38%|██████▊ | 3/8 [00:08<00:13, 2.64s/it]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 17785 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 1 (pid: 17786) of binary: /home/wxd7/anaconda3/envs/GLM/bin/python
Traceback (most recent call last):
File "/home/wxd7/anaconda3/envs/GLM/bin/torchrun", line 8, in
sys.exit(main())
File "/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, kwargs)
File "/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

multi_gpu_fintune_belle.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-04-13_02:02:27
host : wxd7-EG341W-G21
rank : 1 (local_rank: 1)
exitcode : -9 (pid: 17786)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 17786

设备是48G内存，双P40（24G），可以跑单卡，双卡模型加载一半的就报错了

liangwq · Answer 1 · Thu Apr 13 2023 09:14:45 GMT+0800 (China Standard Time)

你用分automodel方式加载的glm模型，模型里面这个moel_chatglm.py，你把这个文件放进去试试

WXD7 · Answer 2 · Thu Apr 13 2023 12:22:36 GMT+0800 (China Standard Time)

谢谢指导：）

我不太明白什么是“模型里面这个moel_chatglm.py，你把这个文件放进去试试”是把modeling_chatglm.py 跟在参数后面的意思吗，可这样也会报错

抱歉我有些新手

运行过程如下：
(GLM) wxd7@wxd7-EG341W-G21:~/glm/Chatglm_lora_multi-gpu-main$ torchrun --nproc_per_node=2 multi_gpu_fintune_belle.py --dataset_path /home/wxd7/glm/ChatGLM-Tuning-master/data/alpaca --lora_rank 8 --per_device_train_batch_size 1 --gradient_accumulation_steps 1 --save_steps 2000 --save_total_limit 2 --learning_rate 2e-5 --fp16 --num_train_epochs 2 --remove_unused_columns false --logging_steps 50 --report_to wandb --output_dir output --deepspeed ds_config_zero3.json modeling_chatglm.py
WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

CUDA SETUP: CUDA runtime path found: /home/wxd7/anaconda3/envs/GLM/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 6.1
CUDA SETUP: Detected CUDA version 113
CUDA SETUP: CUDA runtime path found: /home/wxd7/anaconda3/envs/GLM/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 6.1
CUDA SETUP: Detected CUDA version 113
/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
warn(msg)
CUDA SETUP: Loading binary /home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so...
/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
warn(msg)
CUDA SETUP: Loading binary /home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so...
Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Traceback (most recent call last):
File "/home/wxd7/glm/Chatglm_lora_multi-gpu-main/multi_gpu_fintune_belle.py", line 360, in
main()
File "/home/wxd7/glm/Chatglm_lora_multi-gpu-main/multi_gpu_fintune_belle.py", line 211, in main
).parse_args_into_dataclasses()
File "/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/transformers/hf_argparser.py", line 341, in parse_args_into_dataclasses
raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}")
ValueError: Some specified arguments are not used by the HfArgumentParser: ['--modeling_chatglm.py']
Traceback (most recent call last):
File "/home/wxd7/glm/Chatglm_lora_multi-gpu-main/multi_gpu_fintune_belle.py", line 360, in
main()
File "/home/wxd7/glm/Chatglm_lora_multi-gpu-main/multi_gpu_fintune_belle.py", line 211, in main
).parse_args_into_dataclasses()
File "/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/transformers/hf_argparser.py", line 341, in parse_args_into_dataclasses
raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}")
ValueError: Some specified arguments are not used by the HfArgumentParser: ['--modeling_chatglm.py']
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 34807) of binary: /home/wxd7/anaconda3/envs/GLM/bin/python
Traceback (most recent call last):
File "/home/wxd7/anaconda3/envs/GLM/bin/torchrun", line 8, in
sys.exit(main())
File "/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, kwargs)
File "/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

multi_gpu_fintune_belle.py FAILED

Failures:
[1]:
time : 2023-04-13_12:11:39
host : wxd7-EG341W-G21
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 34808)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-04-13_12:11:39
host : wxd7-EG341W-G21
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 34807)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

模型检查到一半就报错，大佬能帮我看看吗

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

multi_gpu_fintune_belle.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-04-13_02:02:27 host : wxd7-EG341W-G21 rank : 1 (local_rank: 1) exitcode : -9 (pid: 17786) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 17786

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

multi_gpu_fintune_belle.py FAILED

Failures: [1]: time : 2023-04-13_12:11:39 host : wxd7-EG341W-G21 rank : 1 (local_rank: 1) exitcode : 1 (pid: 34808) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-04-13_02:02:27
host : wxd7-EG341W-G21
rank : 1 (local_rank: 1)
exitcode : -9 (pid: 17786)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 17786

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

Failures:
[1]:
time : 2023-04-13_12:11:39
host : wxd7-EG341W-G21
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 34808)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html