thunlp / InfLLM

The code of our paper "InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

GPU memory usage at benchmark

Minami-su opened this issue · comments

I want to know the GPU memory usage,because I'm out of memory when I test the benchmark,
model: Qwen1.5 0.5B Chat
GPU: RTX3090
command:bash scripts/infintebench.sh
bash scripts/longbench.sh
result:cuda out of memory.

Hi! Could you provide more information, such as your configuration and which dataset you were evaluating when the out of memory issue occurred?

Hi! Could you provide more information, such as your configuration and which dataset you were evaluating when the out of memory issue occurred?

config.json

model:
  type: inf-llm
  path: Qwen1.5-0.5B-Chat
  block_size: 128
  n_init: 128
  n_local: 4096
  topk: 16
  repr_topk: 4
  max_cached_block: 32
  exc_block_size: 512
  score_decay: 0.1
  fattn: true
  base: 1000000
  distance_scale: 1.0

max_len: 2147483647
chunk_size: 8192
conv_type: qwen

bash scripts/longbench.sh

mkdir: cannot create directory ‘benchmark/longbench-result’: File exists
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Pred narrativeqa
  0%|                                                                                           | 0/200 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  0%|                                                                                           | 0/200 [00:05<?, ?it/s]
Traceback (most recent call last):
  File "/home/luhao/InfLLM-main2/benchmark/pred.py", line 299, in <module>
    preds = get_pred(
  File "/home/luhao/InfLLM-main2/benchmark/pred.py", line 241, in get_pred
    output = searcher.generate(
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/utils/greedy_search.py", line 33, in generate
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/utils/greedy_search.py", line 55, in _decode
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1173, in forward
    outputs = self.model(
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/utils/patch.py", line 98, in model_forward
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 773, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/utils/patch.py", line 16, in hf_forward
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/attention/inf_llm.py", line 60, in forward
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/attention/context_manager.py", line 726, in append
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/attention/context_manager.py", line 616, in append_global
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/inf_llm-0.0.1-py3.9.egg/inf_llm/attention/context_manager.py", line 20, in __init__
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

/root/anaconda3/envs/train/lib/python3.9/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')

I tested narrative qa on an A40 (48G) using your settings and limited the CUDA memory usage to 24G. No out of memory issue occurs. Did you use the model Qwen/Qwen1.5-0.5B-Chat from the Hugging Face Hub?