thunlp / InfLLM

The code of our paper "InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

OOM issue

microhu opened this issue · comments

great work.
When I try to run a 13B model with more than 100K context length of Passkey retrival task, it throws 'OOM' issue.
Will the code itself support multi-gpu inference?

Thank you for your support.

We currently have no plans to provide multi-gpu inference script.
The goal of this repository is to offer scripts for reproducing the results of our paper,
and to provide efficient implementations of streaming attention.

The simple patch code we are currently using can be found in inf_llm/utils/patch.py.
Attention for different layers can run on different devices, and you can modify the patch code to achieve simple block-level model parallelism.

We encourage users to utilize the implementation in inf_llm/attention within other frameworks.