- [2024/07/04] Support for evaluation with vLLM backend using lm-evaluation-harness.
- [2024/06/21] Added support for inference performance benchmark with LMDeploy and vLLM.
- [2024/06/14] Added support for inference performance benchmark with TensorRT-LLM.
- [2024/06/14] We officially released LLM-Benchmarks!
LLM-Benchmarks is an easy-to-use toolbox for benchmarking Large Language Models (LLMs) performance on inference and evalution.
-
Inference Performance: Benchmarking LLMs service deployed with inference frameworks (e.g., TensorRT-LLM, lmdeploy and vLLM,) under different batch sizes and generation lengths.
-
Task Evaluation: Few-shot evaluation of LLMs throuth APIs including OpenAI, and Triton Inference Server with lm-evaluation-harness.
You can download the dataset by running:
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
You can build docker images by running:
# for tensorrt-llm
bash scripts/trt_llm/build_docker.sh all
# for lmdeploy
bash scripts/lmdeploy/build_docker.sh
# for vllm
bash scripts/vllm/build_docker.sh
- Inference Performance
bash run_benchmark.sh model_path dataset_path sample_num device_id(like 0 or 0,1)
- Task Evaluation
# Build evalution image
bash scripts/evaluation/build_docker.sh vllm # (or lmdeploy or trt-llm)
# Evalution with vLLM backend
bash run_eval.sh mode(fp16, fp8-kv-fp16, fp8-kv-fp8) model_path device_id(like 0 or 0,1)"