Qwen1.5-7B-Chat CUDA error: out of memory

Question

Qwen1.5-7B-Chat CUDA error: out of memory

yinochaos opened this issue 2 months ago · comments

机器配置：A800 80G，机器内存360G
配置文件：

model:
  type: inf-llm
  path: Qwen/Qwen1.5-7B-Chat
  block_size: 128
  n_init: 128
  n_local: 4096
  topk: 16
  repr_topk: 4
  max_cached_block: 32
  exc_block_size: 512
  score_decay: 0.1
  fattn: true
  base: 1000000
  distance_scale: 1.0

max_len: 2147483647
chunk_size: 512
conv_type: qwen

修改pred进行推理，输入token长度大约为28W左右【token长度在19W以内是不会报错的】
报错信息

Traceback (most recent call last):
  File "/root/data/user/XXXX/git/InfLLM/benchmark/common_pred.py", line 325, in <module>
    preds = get_pred(
  File "/root/data/user/XXXX/git/InfLLM/benchmark/common_pred.py", line 271, in get_pred
    output = searcher.generate(
  File "/root/data/user/XXXX/git/InfLLM/inf_llm/utils/greedy_search.py", line 32, in generate
    result = self._decode(input_ids, **kwargs)
  File "/root/data/user/XXXX/git/InfLLM/inf_llm/utils/greedy_search.py", line 54, in _decode
    out = self.model(
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1173, in forward
    outputs = self.model(
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/data/user/XXXX/git/InfLLM/inf_llm/utils/patch.py", line 100, in model_forward
    layer_outputs = decoder_layer(
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 773, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/data/user/XXXX/git/InfLLM/inf_llm/utils/patch.py", line 16, in hf_forward
    ret = forward(
  File "/root/data/user/XXXX/git/InfLLM/inf_llm/attention/inf_llm.py", line 58, in forward
    o = past_key_value.append(
  File "/root/data/user/XXXX/git/InfLLM/inf_llm/attention/context_manager.py", line 725, in append
    self.append_global(ed - st, kv_ed - kv_st, local_score)
  File "/root/data/user/XXXX/git/InfLLM/inf_llm/attention/context_manager.py", line 620, in append_global
    MemoryUnit(self.global_remainder[0][u, :, global_remainder_st:global_remainder_st + self.block_size, :],
  File "/root/data/user/XXXX/git/InfLLM/inf_llm/attention/context_manager.py", line 34, in __init__
    cpu_data = data.contiguous().to("cpu", non_blocking=True).pin_memory()
RuntimeError: CUDA error: out of memory

请问如何解决这个问题，我看显存最大也就是30+G的占用，是哪里出的问题呢？

Pengle Zhang · Answer 1 · Tue Apr 02 2024 22:03:25 GMT+0800 (China Standard Time)

你好，可能是 pin memory 的问题，把 Memory Unit 的 pin memory 去掉试一试

超 · Answer 2 · Wed Apr 03 2024 10:42:07 GMT+0800 (China Standard Time)

你好，可能是 pin memory 的问题，把 Memory Unit 的 pin memory 去掉试一试

你好我把pin memory去掉之后还是一样的报错：

    cpu_data = data.contiguous().to("cpu", non_blocking=True)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

其他一些相关环境信息：
Python 3.10.14
Driver Version: 470.161.03 CUDA Version: 12.1
torch: 2.2.2+cu121
transformers:4.39.2

Pengle Zhang · Answer 3 · Wed Apr 03 2024 11:21:59 GMT+0800 (China Standard Time)

抱歉我们目前没有相同的测试环境，不能复现你的问题
或许可以试一试 CUDA 11.8 的 torch

超 · Answer 4 · Wed Apr 03 2024 12:32:16 GMT+0800 (China Standard Time)

抱歉我们目前没有相同的测试环境，不能复现你的问题或许可以试一试 CUDA 11.8 的 torch

好的，感谢

ehuaa · Answer 5 · Tue Apr 30 2024 18:37:54 GMT+0800 (China Standard Time)

@yinochaos 您好，想问这个问题最后怎么解决的呢？