thunlp / InfLLM

The code of our paper "InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Qwen1.5-7B-Chat CUDA error: out of memory

yinochaos opened this issue · comments

commented

机器配置:A800 80G,机器内存360G
配置文件:

model:
  type: inf-llm
  path: Qwen/Qwen1.5-7B-Chat
  block_size: 128
  n_init: 128
  n_local: 4096
  topk: 16
  repr_topk: 4
  max_cached_block: 32
  exc_block_size: 512
  score_decay: 0.1
  fattn: true
  base: 1000000
  distance_scale: 1.0

max_len: 2147483647
chunk_size: 512
conv_type: qwen

修改pred进行推理,输入token长度大约为28W左右【token长度在19W以内是不会报错的】
报错信息

Traceback (most recent call last):
  File "/root/data/user/XXXX/git/InfLLM/benchmark/common_pred.py", line 325, in <module>
    preds = get_pred(
  File "/root/data/user/XXXX/git/InfLLM/benchmark/common_pred.py", line 271, in get_pred
    output = searcher.generate(
  File "/root/data/user/XXXX/git/InfLLM/inf_llm/utils/greedy_search.py", line 32, in generate
    result = self._decode(input_ids, **kwargs)
  File "/root/data/user/XXXX/git/InfLLM/inf_llm/utils/greedy_search.py", line 54, in _decode
    out = self.model(
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1173, in forward
    outputs = self.model(
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/data/user/XXXX/git/InfLLM/inf_llm/utils/patch.py", line 100, in model_forward
    layer_outputs = decoder_layer(
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 773, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/data/shared/group/common_tools/mambaforge/envs/infllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/data/user/XXXX/git/InfLLM/inf_llm/utils/patch.py", line 16, in hf_forward
    ret = forward(
  File "/root/data/user/XXXX/git/InfLLM/inf_llm/attention/inf_llm.py", line 58, in forward
    o = past_key_value.append(
  File "/root/data/user/XXXX/git/InfLLM/inf_llm/attention/context_manager.py", line 725, in append
    self.append_global(ed - st, kv_ed - kv_st, local_score)
  File "/root/data/user/XXXX/git/InfLLM/inf_llm/attention/context_manager.py", line 620, in append_global
    MemoryUnit(self.global_remainder[0][u, :, global_remainder_st:global_remainder_st + self.block_size, :],
  File "/root/data/user/XXXX/git/InfLLM/inf_llm/attention/context_manager.py", line 34, in __init__
    cpu_data = data.contiguous().to("cpu", non_blocking=True).pin_memory()
RuntimeError: CUDA error: out of memory

请问如何解决这个问题,我看显存最大也就是30+G的占用,是哪里出的问题呢?

你好,可能是 pin memory 的问题,把 Memory Unit 的 pin memory 去掉试一试

commented

你好,可能是 pin memory 的问题,把 Memory Unit 的 pin memory 去掉试一试

你好我把pin memory去掉之后还是一样的报错:

    cpu_data = data.contiguous().to("cpu", non_blocking=True)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

其他一些相关环境信息:
Python 3.10.14
Driver Version: 470.161.03 CUDA Version: 12.1
torch: 2.2.2+cu121
transformers:4.39.2

抱歉我们目前没有相同的测试环境,不能复现你的问题
或许可以试一试 CUDA 11.8 的 torch

commented

抱歉我们目前没有相同的测试环境,不能复现你的问题 或许可以试一试 CUDA 11.8 的 torch

好的,感谢

commented

@yinochaos 您好,想问这个问题最后怎么解决的呢?