luciferlinx101 / punica

Serving multiple LoRA finetuned LLM as one

Home Page:https://arxiv.org/abs/2310.18547

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Punica: Serving multiple LoRA finetuned LLM as one

(paper)

Demo

punica-tui-demo-vp9.webm
python examples/tui-multi-lora.py

Overview

Low rank adapation (LoRA) is a parameter efficient way to add new knowledge to a pretrained LLM. Although the pretrained LLM takes 100s of GB storage, a LoRA finetuned model only adds 1% storage and memory overhead. Punica enables running multiple LoRA finetuned models at the cost of running one.

How?

Assuming W of shape [H1, H2] is the weight of the pretrained model, LoRA adds two small matrices A of shape [H1, r] and B of [r, H2]. Running a input x on the finetuned model would be y := x @ (W + A@B), which is the same as y := x@W + x@A@B.

When there are n LoRA models, there will be A1, B1, A2, B2, ..., An, Bn. Given a input batch X := (x1,x2,...,xn) that maps to each LoRA model, the output is Y := X@W + (x1@A1@B1, x2@A2@B2, ..., xn@An@Bn). The left-hand-side computes the input batch on the pretrained model. It is quite efficient. The latency is almost the same as when there's only one input, thanks to the strong batching effect.

We figured out an efficient way to compute the right-hand-side (the LoRA addon). We encapsulate this operation in a CUDA kernel, called Segmented Gather Matrix-Vector multiplication (SGMV), as illustrated below.

SGMV

In the following microbenchmark figure, we can observe the strong batching effect of the pretrained model. Naive implementation of LoRA is slow, as depicted in the orange line. LoRA implemented via SGMV is efficient and preserves the strong batching effect.

SGMV is fast and maintains strong batching effect

The following figure shows the text generation throughput comparison between Punica and other systems, including HuggingFace Transformers, DeepSpeed, FasterTransformer, vLLM. The benchmark considers different settings of LoRA model popularity. Distinct means that each request is for a different LoRA model. Identical means that all requests are for the same LoRA model. Uniform and Skewed are in between. Punica achieves 12x throughput compared to state-of-the-art systems.

Punica achieves 12x throughput compared to state-of-the-art systems

Read our paper to understand more: Punica: Multi-Tenant LoRA Serving.

Install

git clone https://github.com/punica-ai/punica.git
cd punica
git submodule sync
git submodule update --init --recursive

pip install ninja torch
pip install -v --no-build-isolation .

Examples

Serving multiple LoRA models

See the demo above.

Finetune & convert to Punica format & serve with Punica

See examples/finetune/

Benchmark text generation

python -m benchmarks.bench_textgen_lora --system punica --batch-size 32

Citation

@misc{punica,
    title={Punica: Multi-Tenant LoRA Serving},
    author={Lequn Chen and Zihao Ye and Yongji Wu and Danyang Zhuo and Luis Ceze and Arvind Krishnamurthy},
    year={2023},
    eprint={2310.18547},
    archivePrefix={arXiv},
    primaryClass={cs.DC}
}

About

Serving multiple LoRA finetuned LLM as one

https://arxiv.org/abs/2310.18547

License:Apache License 2.0


Languages

Language:Cuda 63.9%Language:Python 28.3%Language:C++ 7.4%Language:CMake 0.5%