liangwq / Chatglm_lora_multi-gpu

chatglm多gpu用deepspeed和

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

模型检查到一半就报错,大佬能帮我看看吗

WXD7 opened this issue · comments

commented

GLM) wxd7@wxd7-EG341W-G21:~/glm/Chatglm_lora_multi-gpu-main$ torchrun --nproc_per_node=2 multi_gpu_fintune_belle.py --dataset_path /home/wxd7/glm/ChatGLM-Tuning-master/data/alpaca --model_path /home/wxd7/upan/GLM/model/chatglm-6b --lora_rank 8 --per_device_train_batch_size 1 --gradient_accumulation_steps 1 --save_steps 2000 --save_total_limit 2 --learning_rate 2e-5 --fp16 --num_train_epochs 2 --remove_unused_columns false --logging_steps 50 --report_to wandb --output_dir output --deepspeed ds_config_zero3.json
WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

CUDA SETUP: CUDA runtime path found: /home/wxd7/anaconda3/envs/GLM/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 6.1
CUDA SETUP: Detected CUDA version 113
CUDA SETUP: CUDA runtime path found: /home/wxd7/anaconda3/envs/GLM/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 6.1
CUDA SETUP: Detected CUDA version 113
/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
warn(msg)
CUDA SETUP: Loading binary /home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so...
/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
warn(msg)
CUDA SETUP: Loading binary /home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so...
Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Traceback (most recent call last):
File "", line 1, in
FileNotFoundError: [Errno 2] No such file or directory: '/home/wxd7/.cache/huggingface/modules/transformers_modules/THUDM/chatglm-6b/aa51e62ddc9c9f334858b0af44cf59b05c70148a/tokenization_chatglm.py'
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
Loading checkpoint shards: 38%|██████▊ | 3/8 [00:08<00:13, 2.64s/it]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 17785 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 1 (pid: 17786) of binary: /home/wxd7/anaconda3/envs/GLM/bin/python
Traceback (most recent call last):
File "/home/wxd7/anaconda3/envs/GLM/bin/torchrun", line 8, in
sys.exit(main())
File "/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

multi_gpu_fintune_belle.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-04-13_02:02:27
host : wxd7-EG341W-G21
rank : 1 (local_rank: 1)
exitcode : -9 (pid: 17786)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 17786

设备是48G内存,双P40(24G),可以跑单卡,双卡模型加载一半的就报错了

image
你用分automodel方式加载的glm模型,模型里面这个moel_chatglm.py,你把这个文件放进去试试

commented

谢谢指导:)

我不太明白什么是“模型里面这个moel_chatglm.py,你把这个文件放进去试试”是把modeling_chatglm.py 跟在参数后面的意思吗,可这样也会报错

抱歉我有些新手

运行过程如下:
(GLM) wxd7@wxd7-EG341W-G21:~/glm/Chatglm_lora_multi-gpu-main$ torchrun --nproc_per_node=2 multi_gpu_fintune_belle.py --dataset_path /home/wxd7/glm/ChatGLM-Tuning-master/data/alpaca --lora_rank 8 --per_device_train_batch_size 1 --gradient_accumulation_steps 1 --save_steps 2000 --save_total_limit 2 --learning_rate 2e-5 --fp16 --num_train_epochs 2 --remove_unused_columns false --logging_steps 50 --report_to wandb --output_dir output --deepspeed ds_config_zero3.json modeling_chatglm.py
WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

CUDA SETUP: CUDA runtime path found: /home/wxd7/anaconda3/envs/GLM/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 6.1
CUDA SETUP: Detected CUDA version 113
CUDA SETUP: CUDA runtime path found: /home/wxd7/anaconda3/envs/GLM/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 6.1
CUDA SETUP: Detected CUDA version 113
/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
warn(msg)
CUDA SETUP: Loading binary /home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so...
/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
warn(msg)
CUDA SETUP: Loading binary /home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so...
Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Traceback (most recent call last):
File "/home/wxd7/glm/Chatglm_lora_multi-gpu-main/multi_gpu_fintune_belle.py", line 360, in
main()
File "/home/wxd7/glm/Chatglm_lora_multi-gpu-main/multi_gpu_fintune_belle.py", line 211, in main
).parse_args_into_dataclasses()
File "/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/transformers/hf_argparser.py", line 341, in parse_args_into_dataclasses
raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}")
ValueError: Some specified arguments are not used by the HfArgumentParser: ['--modeling_chatglm.py']
Traceback (most recent call last):
File "/home/wxd7/glm/Chatglm_lora_multi-gpu-main/multi_gpu_fintune_belle.py", line 360, in
main()
File "/home/wxd7/glm/Chatglm_lora_multi-gpu-main/multi_gpu_fintune_belle.py", line 211, in main
).parse_args_into_dataclasses()
File "/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/transformers/hf_argparser.py", line 341, in parse_args_into_dataclasses
raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}")
ValueError: Some specified arguments are not used by the HfArgumentParser: ['--modeling_chatglm.py']
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 34807) of binary: /home/wxd7/anaconda3/envs/GLM/bin/python
Traceback (most recent call last):
File "/home/wxd7/anaconda3/envs/GLM/bin/torchrun", line 8, in
sys.exit(main())
File "/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

multi_gpu_fintune_belle.py FAILED

Failures:
[1]:
time : 2023-04-13_12:11:39
host : wxd7-EG341W-G21
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 34808)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-04-13_12:11:39
host : wxd7-EG341W-G21
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 34807)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html