OOM issue

Question

OOM issue

microhu opened this issue 2 months ago · comments

great work.
When I try to run a 13B model with more than 100K context length of Passkey retrival task, it throws 'OOM' issue.
Will the code itself support multi-gpu inference?

Pengle Zhang · Answer 1 · Wed Apr 03 2024 11:51:36 GMT+0800 (China Standard Time)

Thank you for your support.

We currently have no plans to provide multi-gpu inference script.
The goal of this repository is to offer scripts for reproducing the results of our paper,
and to provide efficient implementations of streaming attention.

The simple patch code we are currently using can be found in inf_llm/utils/patch.py.
Attention for different layers can run on different devices, and you can modify the patch code to achieve simple block-level model parallelism.

We encourage users to utilize the implementation in inf_llm/attention within other frameworks.