hamelsmu / llama-inference

experiments with inference on llama

llama inference

Exploration of latency on various setups of inference with llama.

I didn't explore throughput. That is a deep rabbit hole - I was just exploring latency for a single request. You can tradeoff throughput and latency with various forms of batching requests.
I tried my best to use tools based on the documentation provided.

experiments with inference on llama

Language:Python 51.3%Language:Jupyter Notebook 47.2%Language:Shell 1.5%