🕹️ Benchmarks

A fully reproducible Performance Comparison of MLOps Engines, Frameworks, and Languages on Mainstream AI Models

Check out our release blog to know more.

Table of Contents

Quick glance towards performance metrics
ML Engines
Why Benchmarks
Usage and workflow
Contribute

🥽 Quick glance towards performance benchmarks

Take a first glance at Mistral 7B v0.1 Instruct and Llama 2 7B Chat Performance Metrics Across Different Precision and Inference Engines. Here is our run specification that generated this performance benchmark reports.

Environment:

Model: Mistral 7B v0.1 Instruct / Llama 2 7B Chat
CUDA Version: 12.1
Batch size: 1

Command:

./benchmark.sh --repetitions 10 --max_tokens 512 --device cuda --model mistral/llama --prompt 'Write an essay about the transformer model architecture'

Mistral 7B v0.1 Instruct

Performance Metrics: (unit: Tokens/second)

Engine	float32	float16	int8	int4
transformers (pytorch)	39.61 ± 0.65	37.05 ± 0.49	5.08 ± 0.01	19.58 ± 0.38
AutoAWQ	-	-	-	63.12 ± 2.19
AutoGPTQ	39.11 ± 0.42	42.94 ± 0.80
DeepSpeed		79.88 ± 0.32
ctransformers	-	-	86.14 ± 1.40	87.22 ± 1.54
llama.cpp	-	-	88.27 ± 0.72	95.33 ± 5.54
ctranslate	43.17 ± 2.97	68.03 ± 0.27	45.14 ± 0.24	-
PyTorch Lightning	32.79 ± 2.74	43.01 ± 2.90	7.75 ± 0.12	-
Nvidia TensorRT-LLM	117.04 ± 2.16	206.59 ± 6.93	390.49 ± 4.86	427.40 ± 4.84
vllm	84.91 ± 0.27	84.89 ± 0.28	-	106.03 ± 0.53
exllamav2	-	-	114.81 ± 1.47	126.29 ± 3.05
onnx	15.75 ± 0.15	22.39 ± 0.14	-	-
Optimum Nvidia	50.77 ± 0.85	50.91 ± 0.19	-	-

Performance Metrics: GPU Memory Consumption (unit: MB)

Engine	float32	float16	int8	int4
transformers (pytorch)	31071.4	15976.1	10963.91	5681.18
AutoGPTQ	13400.80	6633.29
AutoAWQ	-	-	-	6572.47
DeepSpeed		80097.34
ctransformers	-	-	10255.07	6966.74
llama.cpp	-	-	9141.49	5880.41
ctranslate	32602.32	17523.8	10074.72	-
PyTorch Lightning	48783.95	18738.05	10680.32	-
Nvidia TensorRT-LLM	79536.59	78341.21	77689.0	77311.51
vllm	73568.09	73790.39	-	74016.88
exllamav2	-	-	21483.23	9460.25
onnx	33629.93	19537.07	-	-
Optimum Nvidia	79563.85	79496.74	-	-

*(Data updated: 30th April 2024)

Llama 2 7B Chat

Performance Metrics: (unit: Tokens / second)

Engine	float32	float16	int8	int4
transformers (pytorch)	36.65 ± 0.61	34.20 ± 0.51	6.91 ± 0.14	17.83 ± 0.40
AutoAWQ	-	-	-	63.59 ± 1.86
AutoGPTQ	34.36 ± 0.51	36.63 ± 0.61
DeepSpeed		84.60 ± 0.25
ctransformers	-	-	85.50 ± 1.00	86.66 ± 1.06
llama.cpp	-	-	89.90 ± 2.26	97.35 ± 4.71
ctranslate	46.26 ± 1.59	79.41 ± 0.37	48.20 ± 0.14	-
PyTorch Lightning	38.01 ± 0.09	48.09 ± 1.12	10.68 ± 0.43	-
Nvidia TensorRT-LLM	104.07 ± 1.61	191.00 ± 4.60	316.77 ± 2.14	358.49 ± 2.38
vllm	89.40 ± 0.22	89.43 ± 0.19	-	115.52 ± 0.49
exllamav2	-	-	125.58 ± 1.23	159.68 ± 1.85
onnx	14.28 ± 0.12	19.42 ± 0.08	-	-
Optimum Nvidia	53.64 ± 0.78	53.82 ± 0.11	-	-

Performance Metrics: GPU Memory Consumption (unit: MB)

Engine	float32	float16	int8	int4
transformers (pytorch)	29114.76	14931.72	8596.23	5643.44
AutoAWQ	-	-	-	7149.19
AutoGPTQ	10718.54	5706.35
DeepSpeed		80105.13
ctransformers	-	-	9774.83	6889.14
llama.cpp	-	-	8797.55	5783.95
ctranslate	29951.52	16282.29	9470.74	-
PyTorch Lightning	42748.35	14736.69	8028.16	-
Nvidia TensorRT-LLM	79421.24	78295.07	77642.86	77256.98
vllm	77928.07	77928.07	-	77768.69
exllamav2	-	-	16582.18	7201.62
onnx	33072.09	19180.55	-	-
Optimum Nvidia	79429.63	79295.41	-	-

*(Data updated: 30th April 2024)

Our latest version benchmarks Llama 2 7B chat and Mistral 7B v0.1 instruct. The latest version only benchmarks on A100 80 GPU. Because our primary focus is enterprises. Our previous versions benchmarked Llama 2 7B on Cuda and Mac (M1/M2) CPU and metal. You can find those in the archive.md file. Please note that those numbers are old because all the engines are maintained properly continuously with improvements. So those numbers might be a bit outdated.

🛳 ML Engines

In the current market, there are several ML Engines. Here is a quick glance at all the engines used for the benchmark and a quick summary of their support matrix. You can find the details about the nuances here.

Engine	Float32	Float16	Int8	Int4	CUDA	ROCM	Mac M1/M2	Training
candle	⚠️	✅	⚠️	⚠️	✅	❌	🚧	❌
llama.cpp	❌	❌	✅	✅	✅	🚧	🚧	❌
ctranslate	✅	✅	✅	❌	✅	❌	🚧	❌
onnx	✅	✅	❌	❌	✅	⚠️	❌	❌
transformers (pytorch)	✅	✅	✅	✅	✅	🚧	✅	✅
vllm	✅	✅	❌	✅	✅	🚧	❌	❌
exllamav2	❌	❌	✅	✅	✅	🚧	❌	❌
ctransformers	❌	❌	✅	✅	✅	🚧	🚧	❌
AutoGPTQ	✅	✅	⚠️	⚠️	✅	❌	❌	❌
AutoAWQ	❌	❌	❌	✅	✅	❌	❌	❌
DeepSpeed-MII	❌	✅	❌	❌	✅	❌	❌	⚠️
PyTorch Lightning	✅	✅	✅	✅	✅	⚠️	⚠️	✅
Optimum Nvidia	✅	✅	❌	❌	✅	❌	❌	❌
Nvidia TensorRT-LLM	✅	✅	✅	✅	✅	❌	❌	❌

Legend:

✅ Supported
❌ Not Supported
⚠️ There is a catch related to this
🚧 It is supported but not implemented in this current version

You can check out the nuances related to ⚠️ and 🚧 in details here

🤔 Why Benchmarks

This can be a common question. What are the benefits you can expect from this repository? So here are some quick pointers to answer those.

Oftentimes, we are confused when given several choices on which engines or precision to use for our LLM inference workflow. Because sometimes we have constraints on computing and sometimes we have other requirements. So this repository helps you to get a quick idea of what to use based on your requirements.
Sometimes there comes a quality vs speed tradeoff between engines and precisions. So this repository keeps track of those and gives you an idea to understand the tradeoffs so that you can give more importance to your priorities.
A fully reproducible and hackable script. The latest benchmarks come with a lot of best practices so that they can be robust enough to run on GPU devices. Also, you can reference and extend the implementations to build your own workflows out of it.

🚀 Usage and workflow

Welcome to our benchmarking repository! This organized structure is designed to simplify benchmark management and execution. Each benchmark runs an inference engine that provides some sort of optimizations either through just quantization or device-specific optimizations like custom cuda kernels.

To get started you need to download the models first. This will download the following models: Llama2 7B Chat and Mistral-7B v0.1 Instruct. You can start download by typing this command:

./download.sh

Please make sure that when you are running Llama2-7B Chat weights, we would assume that you already agreed to the required terms and conditions and got verified to download the weights.

A Benchmark workflow

When you run a benchmark, the following set of events occurs:

Automatically setting up the environments and installing the required dependencies.
Converting the models to some specific format (if required) and saving them.
Running the benchmarks and storing them inside the logs folder. Each log folder has the following structure:
performance.log: This will track the model run performances. You can see the token/sec and memory consumption (MB) here.
quality.md: This file is an automatically generated readme file, which contains qualitative comparisons of different precisions of some engines. We take 5 prompts and run them for the set of supported precisions of that engine. We then put those results side by side. Our ground truth is the output from huggingface PyTorch model with raw float32 weights.
quality.json Same as the readme file but more in raw format.

Inside each benchmark folder, you will also see a readme.md file which contains all the information and the qualitative comparison of the engine. For example: bench_tensorrtllm.

Running a Benchmark

Here is how we run benchmarks for an inference engine.

./bench_<engine-name>/bench.sh \
 --prompt <value> \ # Enter a prompt string
 --max_tokens <value> \  # Maximum number of tokens to output
 --repetitions <value> \  # Number of repetitions to be made for the prompt.
 --device <cpu/cuda/metal> \  # The device in which we want to benchmark.
 --model_name <name-of-the-model> # The name of the model. (options: 'llama' for Llama2 and 'mistral' for Mistral-7B-v0.1)

Here is an example. Let's say we want to benchmark Nvidia TensorRT LLM. So here is how the command would look like:

./bench_tensorrtllm/bench.sh -d cuda -n llama -r 10

To know more, here is more detailed info on each command line argument.

 -p, --prompt Prompt for benchmarks (default: 'Write an essay about the transformer model architecture')
 -r, --repetitions Number of repetitions for benchmarks (default: 10)
 -m, --max_tokens Maximum number of tokens for benchmarks (default: 512)
 -d, --device Device for benchmarks (possible values: 'metal', 'cuda', and 'CPU', default: 'cuda')
 -n, --model_name The name of the model to benchmark (possible values: 'llama' for using Llama2, 'mistral' for using Mistral 7B v0.1)
 -lf, --log_file Logging file name.
 -h, --help Show this help message

🤝 Contribute

We welcome contributions to enhance and expand our benchmarking repository. If you'd like to contribute a new benchmark, follow these steps:

Creating a New Benchmark

1. Create a New Folder

Start by creating a new folder for your benchmark. Name it bench_{new_bench_name} for consistency.

mkdir bench_{new_bench_name}

2. Folder Structure

Inside the new benchmark folder, include the following structure

bench_{new_bench_name}
├── bench.sh # Benchmark script for setup and execution
├── requirements.txt # Dependencies required for the benchmark
└── ... # Any additional files needed for the benchmark

3. Benchmark Script (bench.sh):

The bench.sh script should handle setup, environment configuration, and the actual execution of the benchmark. Ensure it supports the parameters mentioned in the Benchmark Script Parameters section.

Pre-commit Hooks

We use pre-commit hooks to maintain code quality and consistency.

1. Install Pre-commit: Ensure you have pre-commit installed

pip install pre-commit

2. Install Hooks: Run the following command to install the pre-commit hooks

pre-commit install

The existing pre-commit configuration will be used for automatic checks before each commit, ensuring code quality and adherence to defined standards.

premAI-io / benchmarks