NVIDIA / TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization and sparsity. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.

Home Page:https://nvidia.github.io/TensorRT-Model-Optimizer

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Request for Documentation of custom quantization algorithm / external quantized weight for AWQ

nuxlear opened this issue · comments

Hello, I have had a trouble that quantized model with AWQ has performance degradation more than expected.

I know that ModelOpt provides optimized kernels and quantization algorithms for fast quantization,
but the guide and documentation are focused on the custom forward-loop in calibration time.

I could not find the solution to apply other quantization method and keep using existing kernels for fast inference.
Specifically, I want to load my quantized weights which have same structures of AWQ (scaling_factors, shifts optionally).

Is there any guide or document about applying quantized weight in quantization process?
else, is there any plan to implement about it?

Hi @nuxlear , ModelOpt does simulated quantization and perf improvement needs to be achieved with TensorRT-LLM. Are you observing perf degradation with ModelOpt or with TRTLLM?

We will have initial support for quantized weight in the next several releases, but the goal is just for memory saving, not speedup. For speedup, you should refer to the guide of TRTLLM deployment

Thank you for replying @RalphMao . I meant the performance as English task performance such as OpenLLM leaderboard. I apologize for the confusion.
I use TRTLLM with Triton inference server right now, and I have experienced speedup and memory saving thanks to it.
My problem is that my AWQ-quantized model has more degradation from FP16 on OpenLLM benchmark than I expected.

It sounds great if the initial support for quantized weight contains some customization methods such as loading pre-quantized weight.
As I understood, you mentioned that there will be such features soon, right?