docker build -t shawoo/mlc-llm .

  魔法
  https://mlc.ai/wheels

LLM Performance Benchmarking

Performance

Model	GPU	MLC LLM (tok/sec)	Exllama (tok/sec)	Llama.cpp (tok/sec)
Llama2-7B	RTX 3090 Ti	166.7	112.72	113.34
Llama2-13B	RTX 3090 Ti	99.2	69.31	71.34
Llama2-7B	RTX 4090	191.0	152.56	50.13
Llama2-13B	RTX 4090	108.8	93.88	36.81

All experiments are based on int4-quantized weights, fp16 activation and compute.

Commit:

MLC LLM commit, TVM commit;
Exllama commit.
Llama.cpp: commit

Instructions

First of all, NVIDIA Docker is required: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#docker.

MLC LLM

Step 1. Build Docker image

docker build -t llm-perf-mlc:v0.1 -f Dockerfile.cu121.mlc .

Step 2. Quantize and run Llama2. Log in to the docker container we created using the comamnd below:

PORT=45678
MODELS=/PATH/TO/MODEL/ # Replace the path to HuggingFace models

docker run            \
  -d -P               \
  --gpus all          \
  -h llm-perf         \
  --name llm-perf     \
  -p $PORT:22         \
  -v $MODELS:/models  \
  llm-perf-mlc:v0.1

# Password is: llm_perf
ssh root@0.0.0.0 -p $PORT

# Inside the container, run the following commands:
micromamba activate python311

cd $MLC_HOME
python build.py                       \
  --model /models/Llama-2-7b-chat-hf  \  # Replace it with path to HuggingFace models
  --target cuda                       \
  --quantization q4f16_1              \
  --artifact-path "./dist"            \
  --use-cache 0

The quantized and compiled model will be exported to ./dist/Llama-2-7b-chat-hf-q4f16_1.

Step 3. Run the CLI tool to see the performance numbers:

$MLC_HOME/build/mlc_chat_cli \
  --model Llama-2-7b-chat-hf \
  --quantization q4f16_1

Exllama

TBD

Llama.cpp

Step 1. Build Docker image

docker build -t llm-perf-llama-cpp:v0.1 -f Dockerfile.cu121.llama_cpp .

Step 2. Download the quantized GGML models and run Llama2 via llama.cpp.

To obtain the quantized GGML model, it is recommended to download it via HuggingFace using the comamnd below:

wget https://huggingface.co/TheBloke/Llama-2-7B-GGML/resolve/main/llama-2-7b.ggmlv3.q4_K_M.bin
wget https://huggingface.co/TheBloke/Llama-2-13B-GGML/resolve/main/llama-2-13b.ggmlv3.q4_K_M.bin

PORT=41514
GGML_BINS=/PATH/TO/GGML_BINS/  # Replace it with path to HuggingFace models

docker run                  \
  -d -P                     \
  --gpus all                \
  -h llm-perf               \
  --name llm-perf-llama-cpp \
  -p $PORT:22               \
  -v $GGML_BINS:/ggml_bins  \
  llm-perf-llama-cpp:v0.1

# Password is: llm_perf
ssh root@0.0.0.0 -p $PORT

Step 3. Run the CLI tool to see the performance numbers:

cd $LLAMA_CPP_HOME
./build/bin/main -m /ggml_bins/llama-2-7b.ggmlv3.q4_K_M.bin -p "Please generate a very long story about wizard and technology, at least two thousand words" -n 128 -ngl 999 --ignore-eos

TODOs

Only decoding performance is currently benchmarked given prefilling usually takes much shorter time with flash attention.

Currently, MLC LLM number includes a long system prompt, while Exllama numbers are from a fixed-length system prompt of 4 tokens, which is not exactly apple-to-apple comparison. Should get it fixed.

wuzhiping / mlc-llm

https://github.com/FlagAlpha/Llama2-Chinese

https://mlc.ai/mlc-llm/docs/

https://webllm.mlc.ai/

https://mlc.ai/package/