thunlp / InfLLM

The code of our paper "InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Qwen1.5-72B-chat-AWQ with longbench and infinibench benchmark OOM with A100 80G

ehuaa opened this issue · comments

commented

When I test Qwen1.5-72B-chat-AWQ with
bash scripts/longbench.sh it turns out to OOM with A100 80G

My config:
model:
type: inf-llm
path: /root/czh/quant_models/Qwen2-geogpt-72b-0412-awq-dde-12000
block_size: 128
n_init: 128
n_local: 4096
topk: 16
repr_topk: 4
max_cached_block: 32
exc_block_size: 512
fattn: false
base: 1000000
distance_scale: 1.0

max_len: 2147483647
chunk_size: 2048
conv_type: qwen

The Traceback is as follows:
Traceback (most recent call last):
File "/root/czh/InfLLM/benchmark/pred.py", line 321, in
preds = get_pred(
File "/root/czh/InfLLM/benchmark/pred.py", line 256, in get_pred
output = searcher.generate(
File "/root/czh/InfLLM/inf_llm/utils/greedy_search.py", line 32, in generate
result = self._decode(input_ids, **kwargs)
File "/root/czh/InfLLM/inf_llm/utils/greedy_search.py", line 54, in _decode
out = self.model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2/modeling_qwen2.py", line 1169, in forward
outputs = self.model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/root/czh/InfLLM/inf_llm/utils/patch.py", line 100, in model_forward
layer_outputs = decoder_layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2/modeling_qwen2.py", line 768, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/root/czh/InfLLM/inf_llm/utils/patch.py", line 16, in hf_forward
ret = forward(
File "/root/czh/InfLLM/inf_llm/attention/inf_llm.py", line 64, in forward
o = past_key_value.append(
File "/root/czh/InfLLM/inf_llm/attention/context_manager.py", line 774, in append
chunk_o, local_score = self._append(
File "/root/czh/InfLLM/inf_llm/attention/context_manager.py", line 526, in _append
attn.append(
File "/root/czh/InfLLM/inf_llm/attention/dot_production_attention/torch_impl.py", line 96, in append
self.finalize()
File "/root/czh/InfLLM/inf_llm/attention/dot_production_attention/torch_impl.py", line 22, in finalize
tmp = torch.masked_fill(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 79.15 GiB of which 190.19 MiB is free. Process 3985934 has 78.95 GiB memory in use. Of the allocated memory 75.61 GiB is allocated by PyTorch, and 2.82 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
/usr/local/lib/python3.10/dist-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
Evaluating on: ['result.json']
{}
Can someone help with this issue? Thanks!