按照教程,一步一步弄的,到了训练PPO的时候, 卡到 CUDA error: device-side assert triggered

Question

按照教程,一步一步弄的,到了训练PPO的时候, 卡到 CUDA error: device-side assert triggered

karl-tao-zhang opened this issue 10 months ago · comments

Using pad_token, but it is not set yet.
Loading base model for ppo training...
加载base
加载lora
加载ppo
WARNING:root:A <class 'peft.peft_model.PeftModelForCausalLM'> model is loaded from '/root/autodl-tmp/LLM/weights/sft_lora', and no v_head weight is found. This IS expected if you are not resuming PPO training.
Loading base model for reward model...
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
Some weights of BaichuanForSequenceClassification were not initialized from the model checkpoint at baichuan-inc/baichuan-7B and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
开始训练
0it [00:00, ?it/s]---------------------
CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

0
0it [00:10, ?it/s]
Traceback (most recent call last):
File "rl_training.py", line 331, in
response_tensors = ppo_trainer.generate(
File "/root/miniconda3/lib/python3.8/site-packages/trl/trainer/ppo_trainer.py", line 446, in generate
return self._generate_batched(
File "/root/miniconda3/lib/python3.8/site-packages/trl/trainer/ppo_trainer.py", line 503, in _generate_batched
generations = self.accelerator.unwrap_model(self.model).generate(**padded_inputs, **generation_kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/trl/models/modeling_value_head.py", line 198, in generate
return self.pretrained_model.generate(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/peft/peft_model.py", line 975, in generate
outputs = self.base_model.generate(**kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/transformers/generation/utils.py", line 1648, in generate
return self.sample(
File "/root/miniconda3/lib/python3.8/site-packages/transformers/generation/utils.py", line 2730, in sample
outputs = self(
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/hooks.py", line 166, in new_forward
return module._hf_hook.post_forward(module, output)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/hooks.py", line 305, in post_forward
output = send_to_device(output, self.input_device, skip_keys=self.skip_keys)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/utils/operations.py", line 160, in send_to_device
{
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/utils/operations.py", line 161, in
k: t if k in skip_keys else send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/utils/operations.py", line 151, in send_to_device
return honor_type(
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/utils/operations.py", line 83, in honor_type
return type(obj)(generator)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/utils/operations.py", line 152, in
tensor, (send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys) for t in tensor)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/utils/operations.py", line 151, in send_to_device
return honor_type(
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/utils/operations.py", line 83, in honor_type
return type(obj)(generator)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/utils/operations.py", line 152, in
tensor, (send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys) for t in tensor)
File "/root/miniconda3/lib/python3.8/site-packages/accelerate/utils/operations.py", line 167, in send_to_device
return tensor.to(device, non_blocking=non_blocking)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "rl_training.py", line 364, in
print(question_tensors)
File "/root/miniconda3/lib/python3.8/site-packages/torch/_tensor.py", line 426, in repr
return torch._tensor_str._str(self, tensor_contents=tensor_contents)
File "/root/miniconda3/lib/python3.8/site-packages/torch/_tensor_str.py", line 636, in _str
return _str_intern(self, tensor_contents=tensor_contents)
File "/root/miniconda3/lib/python3.8/site-packages/torch/_tensor_str.py", line 567, in _str_intern
tensor_str = _tensor_str(self, indent)
File "/root/miniconda3/lib/python3.8/site-packages/torch/_tensor_str.py", line 327, in _tensor_str
formatter = _Formatter(get_summarized_data(self) if summarize else self)
File "/root/miniconda3/lib/python3.8/site-packages/torch/_tensor_str.py", line 111, in init
value_str = "{}".format(value)
File "/root/miniconda3/lib/python3.8/site-packages/torch/_tensor.py", line 872, in format
return self.item().format(format_spec)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

karl-tao-zhang · Answer 1 · Sat Sep 09 2023 19:16:38 GMT+0800 (China Standard Time)

CUDA_VISIBLE_DEVICES=0,1,2,3 python rl_training.py
--base_model_name baichuan-inc/baichuan-7B
--merged_sft_model_path /root/autodl-tmp/LLM/weights/sft_lora
--sft_model_lora_path /root/autodl-tmp/LLM/weights/sft_lora
--reward_model_lora_path /root/autodl-tmp/LLM/weights/rm_lora
--adafactor False
--save_freq 10
--output_max_length 256
--batch_size 2
--gradient_accumulation_steps 2
--batched_gen True
--ppo_epochs 4
--seed 0
--learning_rate 1e-5
--early_stopping True
--output_dir /root/autodl-tmp/LLM/weights/ppo_lora \

karl-tao-zhang · Answer 2 · Sat Sep 09 2023 19:27:12 GMT+0800 (China Standard Time)

4张3090 显存不够换到了 4张A40, 出现上述错误,
出现错误后, 我去 trl的issues找了找相关的代码, 说是要这么解决吗?
tokenizer.eos_token_id = model.config.eos_token_id
tokenizer.pad_token = tokenizer.eos_token

karl-tao-zhang · Answer 3 · Sun Sep 10 2023 07:06:34 GMT+0800 (China Standard Time)

1张卡才行