yzh119 / llm-perf-bench

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MLC LLM Performance Benchmarking

Performance Numbers

Model GPU MLC LLM (tok/sec) Exllama (tok/sec)
Llama2-7B RTX 3090 Ti 154.1 116.38
Llama2-13B RTX 3090 Ti 93.1 70.45

Commit:

Step-by-step Guide

First of all, NVIDIA Docker is required: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#docker.

Step 1. Build Docker image

docker build -t mlc-perf:v0.1 .

Step 2. Compile and run Llama2

First, log in to the docker container we created using the comamnd below:

PORT=45678
MODELS=$HOME/models/

docker run            \
  -d -P               \
  --gpus all          \
  -h mlc-perf         \
  --name mlc-perf     \
  -p $PORT:22         \
  -v $MODELS:/models  \
  mlc-perf:v0.1
ssh root@0.0.0.0 -p $PORT # password: mlc_llm_perf

Note: There might be security concerns to allow direct root login. Here we mainly want to simplify the process as a quick demo.

Then, compile Llama2 model using MLC inside the docker container:

micromamba activate python311

cd $MLC_HOME
python build.py \
  --model /models/Llama-2/hf/Llama-2-7b-chat-hf \
  --target cuda \
  --quantization q4f16_1 \
  --artifact-path "./dist" \
  --use-cache 0

The quantized and compiled model will be exported to ./dist/Llama-2-7b-chat-hf-q4f16_1.

Finally, run the model and see the performance numbers:

$MLC_HOME/build/mlc_chat_cli \
  --model Llama-2-7b-chat-hf \
  --quantization q4f16_1

TODOs

Only decoding performance is currently benchmarked given prefilling usually takes much shorter time with flash attention.

Currently, MLC LLM number includes a long system prompt, while Exllama numbers are from a fixed-length system prompt of 4 tokens, which is not exactly apple-to-apple comparison. Should get it fixed.

About


Languages

Language:Shell 69.7%Language:Dockerfile 30.3%