CacheGen: Fast Context Loading for Language Model Applications via KV Cache Streaming

This is the code repo for CacheGen: Fast Context Loading for Language Model Applications via KV Cache Streaming.

Installation

To install the required python packages to run CacheGen with conda

conda env create -f env.yml

Build the GPU-version Arithmetic Coding (AC) decoder

cd src/decoder
python setup.py install

GPU-version AC encoder will come soon!

Example run

To generate the KV cache given a text file, run

LAYERS=32 CHANNELS=4096 python main.py --generate_kv --path 9k_prompts/1.txt --save_dir <PATH TO YOUR HOME DIRECTORY>

To run encoding and decoding for a LongChat-7b model

mkdir data

THREADS=128 BLOCKS=32 LAYERS=32 CHANNELS=4096 python main.py --path 9k_prompts/1.txt --save_dir <PATH TO YOUR HOME DIRECTORY>

Where LAYERS is the number of layers in the LLM, and CHANNELS is the number of channels in the LLM.

References

@misc{liu2024cachegen,
      title={CacheGen: Fast Context Loading for Language Model Applications via KV Cache Streaming}, 
      author={Yuhan Liu and Hanchen Li and Yihua Cheng and Siddhant Ray and Yuyang Huang and Qizheng Zhang and Kuntai Du and Jiayi Yao and Shan Lu and Ganesh Ananthanarayanan and Michael Maire and Henry Hoffmann and Ari Holtzman and Junchen Jiang},
      year={2024},
      eprint={2310.07240},
      archivePrefix={arXiv},
      primaryClass={cs.NI}
}

MARD1NO / CacheGen

CacheGen: Fast Context Loading for Language Model Applications via KV Cache Streaming

Installation

Example run

References

FAQs

About

Languages