NVIDIA / TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, sparsity, distillation, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.

Home Page:https://nvidia.github.io/TensorRT-Model-Optimizer

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to choose different alpha for mtq.INT8_SMOOTHQUANT_CFG?

siahuat0727 opened this issue · comments

Hi, I wonder is it possible to choose different alpha for mtq.INT8_SMOOTHQUANT_CFG?

I found an example here and it works!

quant_cfg["algorithm"] = {"method": "smoothquant", "alpha": 0.5} # type: ignore[index]

But I noticed that setting alpha != 1 in SmoothQuant leads to different scales for qkv and some linear layers, which seems to prevent fusion with the previous norm layer. Shouldn't these layers have the same smooth scale for proper fusion?

Is this a bug or am I misunderstanding something?

Thanks!

with alpha!=1, qkv will have different pre-quant scaling factors and we do a postprocess to resmooth it, so not a bug
This also happens to AWQ.

Thanks! Clears things up on rescaling for alpha!=1. Does modelopt handle the rescaling internally? Ideally, I'd love to see an example of how to grab those resmoothed rescaling factors. @RalphMao

@siahuat0727 modelopt handles the rescaling internally during tensorrtllm checkpoint export .

There are no public examples which showcase this.