NVIDIA / TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization and sparsity. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.

Home Page:https://nvidia.github.io/TensorRT-Model-Optimizer

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Question] Does the quantized model run in full precision or int8 precision?

leeeizhang opened this issue · comments

I used the modelopt to quantize my models into the int8 ONNX model.
However, when I visualize its ONNX graph, I am not sure whether it is computing in full precision or int8 precision.

image

It seems like the input and weights are quantized into int8 to store in GPU memory. But before the matmul operations, these inputs and weights are still dequantized into full precision (e.g., fp32) for computing. Correct me if I am wrong.

I also profiled its execution runtime, and the kernel names are:
sm80_xmma_gemm_i8f32_i8i32_f32_tn_n_tilesize128x128x64_stage3_warpsize2x2x1_tensor16x8x32_execute_kernel_trt
sm80_xmma_gemm_f32f32_tf32f32_f32_nn_n_tilesize64x128x16_stage4_warpsize2x2x1_tensor16x8x8_execute_kernel_trt
sm80_xmma_gemm_i8f32_i8i32_f32_tn_n_tilesize128x128x64_stage3_warpsize2x2x1_tensor16x8x32_fused

So what do i8f32 and i8i32 mean? Does it mean the int8 weights/inputs are converted into f32 or int32?

@leeeizhang your ONNX graph looks correct to me. TRT will recognize the Q/DQ layers before matmul and replace it with quantized matmul (i.e, in TRT input and weights are quantized to int8 and the int8 GEMM will be used for matmul)

@leeeizhang your ONNX graph looks correct to me. TRT will recognize the Q/DQ layers before matmul and replace it with quantized matmul (i.e, in TRT input and weights are quantized to int8 and the int8 GEMM will be used for matmul)

LGTM! Many thanks!