liucongg / ChatGLM-Finetuning

基于ChatGLM-6B、ChatGLM2-6B、ChatGLM3-6B模型,进行下游具体任务微调,涉及Freeze、Lora、P-tuning、全参微调等

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ChatGLM3四卡训练出错了

eanfs opened this issue · comments

[2024-02-04 17:56:47,007] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /root/.cache/torch_extensions/py311_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py311_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py311_cu116 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py311_cu116 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py311_cu116/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Traceback (most recent call last):
File "/home/workspace/ChatGLM-Finetuning/train.py", line 234, in
main()
File "/home/workspace/ChatGLM-Finetuning/train.py", line 178, in main
model, optimizer, _, lr_scheduler = deepspeed.initialize(model=model, args=args, config=ds_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/init.py", line 171, in initialize
engine = DeepSpeedEngine(args=args,
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 304, in init
self._configure_optimizer(optimizer, model_parameters)
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1186, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1263, in _configure_basic_optimizer
optimizer = FusedAdam(
^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in init
fused_adam_cuda = FusedAdamBuilder().load()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 446, in load
return self.jit_load(verbose)
^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 489, in jit_load
op_module = load(name=self.name,
^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1534, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1936, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "", line 573, in module_from_spec
File "", line 1233, in create_module
File "", line 241, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py311_cu116/fused_adam/fused_adam.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv
Loading extension module fused_adam...
Traceback (most recent call last):
File "/home/workspace/ChatGLM-Finetuning/train.py", line 234, in
main()
File "/home/workspace/ChatGLM-Finetuning/train.py", line 178, in main
model, optimizer, _, lr_scheduler = deepspeed.initialize(model=model, args=args, config=ds_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/init.py", line 171, in initialize
engine = DeepSpeedEngine(args=args,
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 304, in init
self._configure_optimizer(optimizer, model_parameters)
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1186, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1263, in _configure_basic_optimizer
optimizer = FusedAdam(
^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in init
fused_adam_cuda = FusedAdamBuilder().load()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 446, in load
return self.jit_load(verbose)
^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 489, in jit_load
op_module = load(name=self.name,
^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1534, in _jit_compile
return _import_module_from_library(name, build_directory, is_python_module)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1936, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "", line 573, in module_from_spec
File "", line 1233, in create_module
File "", line 241, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py311_cu116/fused_adam/fused_adam.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv
Loading extension module fused_adam...
Loading extension module fused_adam...
Traceback (most recent call last):
File "/home/workspace/ChatGLM-Finetuning/train.py", line 234, in
main()
File "/home/workspace/ChatGLM-Finetuning/train.py", line 178, in main
model, optimizer, _, lr_scheduler = deepspeed.initialize(model=model, args=args, config=ds_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/init.py", line 171, in initialize
Traceback (most recent call last):
File "/home/workspace/ChatGLM-Finetuning/train.py", line 234, in
engine = DeepSpeedEngine(args=args,
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 304, in init
main()
self._configure_optimizer(optimizer, model_parameters)
File "/home/workspace/ChatGLM-Finetuning/train.py", line 178, in main
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1186, in _configure_optimizer
model, optimizer, _, lr_scheduler = deepspeed.initialize(model=model, args=args, config=ds_config,
basic_optimizer = self._configure_basic_optimizer(model_parameters)
^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/init.py", line 171, in initialize
^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1263, in _configure_basic_optimizer
engine = DeepSpeedEngine(args=args,
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 304, in init
optimizer = FusedAdam(
self._configure_optimizer(optimizer, model_parameters)
^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1186, in _configure_optimizer
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in init
fused_adam_cuda = FusedAdamBuilder().load()
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 446, in load
basic_optimizer = self._configure_basic_optimizer(model_parameters)
^^^^^^ ^return self.jit_load(verbose)^
^^^^^^^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1263, in _configure_basic_optimizer
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 489, in jit_load
op_module = load(name=self.name,
^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1284, in load
optimizer = FusedAdam(
^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in init
fused_adam_cuda = FusedAdamBuilder().load()
^^^ ^return _jit_compile(^
^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^
^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 446, in load
^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1534, in _jit_compile
return self.jit_load(verbose)
^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 489, in jit_load
return _import_module_from_library(name, build_directory, is_python_module)
op_module = load(name=self.name,
^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^ File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1284, in load
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1936, in _import_module_from_library
return _jit_compile(
^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1534, in _jit_compile
module = importlib.util.module_from_spec(spec)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^return _import_module_from_library(name, build_directory, is_python_module)^
^
File "", line 573, in module_from_spec
File "", line 1233, in create_module
File "", line 241, in _call_with_frames_removed
ImportError : /root/.cache/torch_extensions/py311_cu116/fused_adam/fused_adam.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/chatglm/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1936, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "", line 573, in module_from_spec
File "", line 1233, in create_module
File "", line 241, in _call_with_frames_removed
ImportError: /root/.cache/torch_extensions/py311_cu116/fused_adam/fused_adam.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv
[2024-02-04 17:56:50,782] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 30665
[2024-02-04 17:56:50,797] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 30666
[2024-02-04 17:56:50,807] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 30667
[2024-02-04 17:56:50,817] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 30668
[2024-02-04 17:56:50,818] [ERROR] [launch.py:321:sigkill_handler] ['/root/miniconda3/envs/chatglm/bin/python', '-u', 'train.py', '--local_rank=3', '--train_path', 'data/d2q_0.json', '--model_name_or_path', 'chatglm3-6b/', '--per_device_train_batch_size', '1', '--max_len', '1560', '--max_src_len', '1024', '--learning_rate', '1e-4', '--weight_decay', '0.1', '--num_train_epochs', '2', '--gradient_accumulation_steps', '4', '--warmup_ratio', '0.1', '--mode', 'glm3', '--train_type', 'lora', '--freeze_module_name', 'layers.27.,layers.26.,layers.25.,layers.24.', '--seed', '1234', '--ds_file', 'ds_zero2_no_offload.json', '--gradient_checkpointing', '--show_loss_step', '10', '--output_dir', './output-glm3'] exits with return code = 1

环境坏了, 二进制不兼容, 重新做系统吧
_ZNSt15__exception_ptr13exception_ptr9_M_addrefEv 是c++相关的错误