CacheBlend (Under Construction):

This is the code repo for CacheBlend: Fast Large Language Model Serving with Cached Knowledge Fusion. The current implementation is based on vLLM.

Installation

Python>=3.9 and CUDA >= 12.1 are required. An Nvidia GPU with >=40 GB memory is recommended. To install CacheBlend depenencies:

git clone git@github.com:YaoJiayi/CacheBlend.git
cd CacheBlend/vllm_blend
pip install -e .
cd ..
pip install -r requirements.txt

Example run

Run LLM inference with CacheBlend

python example/blend.py

Run Musique dataset

Compare LLM inference with CacheBlend and normal prefill

python example/blend_musique.py

To run datasets other than musique, please replace musique with samsum or wikimqa in the above command.

References

About

Languages

Language:Python 81.2%Language:Cuda 13.5%Language:C++ 3.6%Language:CMake 0.7%Language:Shell 0.5%Language:Jinja 0.2%Language:Dockerfile 0.2%Language:C 0.1%Language:Batchfile 0.0%Language:Makefile 0.0%