A fully reproducible Performance Comparison of MLOps Engines, Frameworks, and Languages on Mainstream AI Models
Check out our release blog to know more.
Table of Contents
Take a first glance at Mistral 7B v0.1 Instruct and Llama 2 7B Chat Performance Metrics Across Different Precision and Inference Engines. Here is our run specification that generated this performance benchmark reports.
Environment:
- Model: Mistral 7B v0.1 Instruct / Llama 2 7B Chat
- CUDA Version: 12.1
- Batch size: 1
Command:
./benchmark.sh --repetitions 10 --max_tokens 512 --device cuda --model mistral/llama --prompt 'Write an essay about the transformer model architecture'
Performance Metrics: (unit: Tokens/second)
Engine | float32 | float16 | int8 | int4 |
---|---|---|---|---|
transformers (pytorch) | 39.61 Β± 0.65 | 37.05 Β± 0.49 | 5.08 Β± 0.01 | 19.58 Β± 0.38 |
AutoAWQ | - | - | - | 63.12 Β± 2.19 |
AutoGPTQ | 39.11 Β± 0.42 | 42.94 Β± 0.80 | ||
DeepSpeed | 79.88 Β± 0.32 | |||
ctransformers | - | - | 86.14 Β± 1.40 | 87.22 Β± 1.54 |
llama.cpp | - | - | 88.27 Β± 0.72 | 95.33 Β± 5.54 |
ctranslate | 43.17 Β± 2.97 | 68.03 Β± 0.27 | 45.14 Β± 0.24 | - |
PyTorch Lightning | 32.79 Β± 2.74 | 43.01 Β± 2.90 | 7.75 Β± 0.12 | - |
Nvidia TensorRT-LLM | 117.04 Β± 2.16 | 206.59 Β± 6.93 | 390.49 Β± 4.86 | 427.40 Β± 4.84 |
vllm | 84.91 Β± 0.27 | 84.89 Β± 0.28 | - | 106.03 Β± 0.53 |
exllamav2 | - | - | 114.81 Β± 1.47 | 126.29 Β± 3.05 |
onnx | 15.75 Β± 0.15 | 22.39 Β± 0.14 | - | - |
Optimum Nvidia | 50.77 Β± 0.85 | 50.91 Β± 0.19 | - | - |
Performance Metrics: GPU Memory Consumption (unit: MB)
Engine | float32 | float16 | int8 | int4 |
---|---|---|---|---|
transformers (pytorch) | 31071.4 | 15976.1 | 10963.91 | 5681.18 |
AutoGPTQ | 13400.80 | 6633.29 | ||
AutoAWQ | - | - | - | 6572.47 |
DeepSpeed | 80097.34 | |||
ctransformers | - | - | 10255.07 | 6966.74 |
llama.cpp | - | - | 9141.49 | 5880.41 |
ctranslate | 32602.32 | 17523.8 | 10074.72 | - |
PyTorch Lightning | 48783.95 | 18738.05 | 10680.32 | - |
Nvidia TensorRT-LLM | 79536.59 | 78341.21 | 77689.0 | 77311.51 |
vllm | 73568.09 | 73790.39 | - | 74016.88 |
exllamav2 | - | - | 21483.23 | 9460.25 |
onnx | 33629.93 | 19537.07 | - | - |
Optimum Nvidia | 79563.85 | 79496.74 | - | - |
*(Data updated: 30th April 2024
)
Performance Metrics: (unit: Tokens / second)
Engine | float32 | float16 | int8 | int4 |
---|---|---|---|---|
transformers (pytorch) | 36.65 Β± 0.61 | 34.20 Β± 0.51 | 6.91 Β± 0.14 | 17.83 Β± 0.40 |
AutoAWQ | - | - | - | 63.59 Β± 1.86 |
AutoGPTQ | 34.36 Β± 0.51 | 36.63 Β± 0.61 | ||
DeepSpeed | 84.60 Β± 0.25 | |||
ctransformers | - | - | 85.50 Β± 1.00 | 86.66 Β± 1.06 |
llama.cpp | - | - | 89.90 Β± 2.26 | 97.35 Β± 4.71 |
ctranslate | 46.26 Β± 1.59 | 79.41 Β± 0.37 | 48.20 Β± 0.14 | - |
PyTorch Lightning | 38.01 Β± 0.09 | 48.09 Β± 1.12 | 10.68 Β± 0.43 | - |
Nvidia TensorRT-LLM | 104.07 Β± 1.61 | 191.00 Β± 4.60 | 316.77 Β± 2.14 | 358.49 Β± 2.38 |
vllm | 89.40 Β± 0.22 | 89.43 Β± 0.19 | - | 115.52 Β± 0.49 |
exllamav2 | - | - | 125.58 Β± 1.23 | 159.68 Β± 1.85 |
onnx | 14.28 Β± 0.12 | 19.42 Β± 0.08 | - | - |
Optimum Nvidia | 53.64 Β± 0.78 | 53.82 Β± 0.11 | - | - |
Performance Metrics: GPU Memory Consumption (unit: MB)
Engine | float32 | float16 | int8 | int4 |
---|---|---|---|---|
transformers (pytorch) | 29114.76 | 14931.72 | 8596.23 | 5643.44 |
AutoAWQ | - | - | - | 7149.19 |
AutoGPTQ | 10718.54 | 5706.35 | ||
DeepSpeed | 80105.13 | |||
ctransformers | - | - | 9774.83 | 6889.14 |
llama.cpp | - | - | 8797.55 | 5783.95 |
ctranslate | 29951.52 | 16282.29 | 9470.74 | - |
PyTorch Lightning | 42748.35 | 14736.69 | 8028.16 | - |
Nvidia TensorRT-LLM | 79421.24 | 78295.07 | 77642.86 | 77256.98 |
vllm | 77928.07 | 77928.07 | - | 77768.69 |
exllamav2 | - | - | 16582.18 | 7201.62 |
onnx | 33072.09 | 19180.55 | - | - |
Optimum Nvidia | 79429.63 | 79295.41 | - | - |
*(Data updated: 30th April 2024
)
Our latest version benchmarks Llama 2 7B chat and Mistral 7B v0.1 instruct. The latest version only benchmarks on A100 80 GPU. Because our primary focus is enterprises. Our previous versions benchmarked Llama 2 7B on Cuda and Mac (M1/M2) CPU and metal. You can find those in the archive.md file. Please note that those numbers are old because all the engines are maintained properly continuously with improvements. So those numbers might be a bit outdated.
In the current market, there are several ML Engines. Here is a quick glance at all the engines used for the benchmark and a quick summary of their support matrix. You can find the details about the nuances here.
Engine | Float32 | Float16 | Int8 | Int4 | CUDA | ROCM | Mac M1/M2 | Training |
---|---|---|---|---|---|---|---|---|
candle | β | β | β | π§ | β | |||
llama.cpp | β | β | β | β | β | π§ | π§ | β |
ctranslate | β | β | β | β | β | β | π§ | β |
onnx | β | β | β | β | β | β | β | |
transformers (pytorch) | β | β | β | β | β | π§ | β | β |
vllm | β | β | β | β | β | π§ | β | β |
exllamav2 | β | β | β | β | β | π§ | β | β |
ctransformers | β | β | β | β | β | π§ | π§ | β |
AutoGPTQ | β | β | β | β | β | β | ||
AutoAWQ | β | β | β | β | β | β | β | β |
DeepSpeed-MII | β | β | β | β | β | β | β | |
PyTorch Lightning | β | β | β | β | β | β | ||
Optimum Nvidia | β | β | β | β | β | β | β | β |
Nvidia TensorRT-LLM | β | β | β | β | β | β | β | β |
- β Supported
- β Not Supported
β οΈ There is a catch related to this- π§ It is supported but not implemented in this current version
You can check out the nuances related to
This can be a common question. What are the benefits you can expect from this repository? So here are some quick pointers to answer those.
-
Oftentimes, we are confused when given several choices on which engines or precision to use for our LLM inference workflow. Because sometimes we have constraints on computing and sometimes we have other requirements. So this repository helps you to get a quick idea of what to use based on your requirements.
-
Sometimes there comes a quality vs speed tradeoff between engines and precisions. So this repository keeps track of those and gives you an idea to understand the tradeoffs so that you can give more importance to your priorities.
-
A fully reproducible and hackable script. The latest benchmarks come with a lot of best practices so that they can be robust enough to run on GPU devices. Also, you can reference and extend the implementations to build your own workflows out of it.
Welcome to our benchmarking repository! This organized structure is designed to simplify benchmark management and execution. Each benchmark runs an inference engine that provides some sort of optimizations either through just quantization or device-specific optimizations like custom cuda kernels.
To get started you need to download the models first. This will download the following models: Llama2 7B Chat and Mistral-7B v0.1 Instruct. You can start download by typing this command:
./download.sh
Please make sure that when you are running Llama2-7B Chat weights, we would assume that you already agreed to the required terms and conditions and got verified to download the weights.
When you run a benchmark, the following set of events occurs:
-
Automatically setting up the environments and installing the required dependencies.
-
Converting the models to some specific format (if required) and saving them.
-
Running the benchmarks and storing them inside the logs folder. Each log folder has the following structure:
-
performance.log
: This will track the model run performances. You can see thetoken/sec
andmemory consumption (MB)
here. -
quality.md
: This file is an automatically generated readme file, which contains qualitative comparisons of different precisions of some engines. We take 5 prompts and run them for the set of supported precisions of that engine. We then put those results side by side. Our ground truth is the output from huggingface PyTorch model with raw float32 weights. -
quality.json
Same as the readme file but more in raw format.
Inside each benchmark folder, you will also see a readme.md file which contains all the information and the qualitative comparison of the engine. For example: bench_tensorrtllm.
Here is how we run benchmarks for an inference engine.
./bench_<engine-name>/bench.sh \
--prompt <value> \ # Enter a prompt string
--max_tokens <value> \ # Maximum number of tokens to output
--repetitions <value> \ # Number of repetitions to be made for the prompt.
--device <cpu/cuda/metal> \ # The device in which we want to benchmark.
--model_name <name-of-the-model> # The name of the model. (options: 'llama' for Llama2 and 'mistral' for Mistral-7B-v0.1)
Here is an example. Let's say we want to benchmark Nvidia TensorRT LLM. So here is how the command would look like:
./bench_tensorrtllm/bench.sh -d cuda -n llama -r 10
To know more, here is more detailed info on each command line argument.
-p, --prompt Prompt for benchmarks (default: 'Write an essay about the transformer model architecture')
-r, --repetitions Number of repetitions for benchmarks (default: 10)
-m, --max_tokens Maximum number of tokens for benchmarks (default: 512)
-d, --device Device for benchmarks (possible values: 'metal', 'cuda', and 'CPU', default: 'cuda')
-n, --model_name The name of the model to benchmark (possible values: 'llama' for using Llama2, 'mistral' for using Mistral 7B v0.1)
-lf, --log_file Logging file name.
-h, --help Show this help message
We welcome contributions to enhance and expand our benchmarking repository. If you'd like to contribute a new benchmark, follow these steps:
1. Create a New Folder
Start by creating a new folder for your benchmark. Name it bench_{new_bench_name}
for consistency.
mkdir bench_{new_bench_name}
2. Folder Structure
Inside the new benchmark folder, include the following structure
bench_{new_bench_name}
βββ bench.sh # Benchmark script for setup and execution
βββ requirements.txt # Dependencies required for the benchmark
βββ ... # Any additional files needed for the benchmark
3. Benchmark Script (bench.sh
):
The bench.sh
script should handle setup, environment configuration, and the actual execution of the benchmark. Ensure it supports the parameters mentioned in the Benchmark Script Parameters section.
We use pre-commit hooks to maintain code quality and consistency.
1. Install Pre-commit: Ensure you have pre-commit
installed
pip install pre-commit
2. Install Hooks: Run the following command to install the pre-commit hooks
pre-commit install
The existing pre-commit configuration will be used for automatic checks before each commit, ensuring code quality and adherence to defined standards.