James-QiuHaoran / LLM-serving-with-proxy-models

Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction | A tiny BERT model can tell you the verbosity of an LLM (with low latency overhead!)

Repository from Github https://github.comJames-QiuHaoran/LLM-serving-with-proxy-models

James-QiuHaoran/LLM-serving-with-proxy-models Stargazers