MARD1NO / CacheGen

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CacheGen: Fast Context Loading for Language Model Applications via KV Cache Streaming

This is the code repo for CacheGen: Fast Context Loading for Language Model Applications via KV Cache Streaming.

Installation

  1. To install the required python packages to run CacheGen with conda
conda env create -f env.yml 
  1. Build the GPU-version Arithmetic Coding (AC) decoder
cd src/decoder
python setup.py install

GPU-version AC encoder will come soon!

Example run

To generate the KV cache given a text file, run

LAYERS=32 CHANNELS=4096 python main.py --generate_kv --path 9k_prompts/1.txt --save_dir <PATH TO YOUR HOME DIRECTORY>

To run encoding and decoding for a LongChat-7b model

mkdir data

THREADS=128 BLOCKS=32 LAYERS=32 CHANNELS=4096 python main.py --path 9k_prompts/1.txt --save_dir <PATH TO YOUR HOME DIRECTORY>

Where LAYERS is the number of layers in the LLM, and CHANNELS is the number of channels in the LLM.

References

@misc{liu2024cachegen,
      title={CacheGen: Fast Context Loading for Language Model Applications via KV Cache Streaming}, 
      author={Yuhan Liu and Hanchen Li and Yihua Cheng and Siddhant Ray and Yuyang Huang and Qizheng Zhang and Kuntai Du and Jiayi Yao and Shan Lu and Ganesh Ananthanarayanan and Michael Maire and Henry Hoffmann and Ari Holtzman and Junchen Jiang},
      year={2024},
      eprint={2310.07240},
      archivePrefix={arXiv},
      primaryClass={cs.NI}
}

FAQs

About


Languages

Language:Python 78.2%Language:Cuda 14.9%Language:Jupyter Notebook 6.6%Language:C++ 0.3%