[Question] Does the quantized model run in full precision or int8 precision?

Question

[Question] Does the quantized model run in full precision or int8 precision?

leeeizhang opened this issue 21 days ago · comments

I used the modelopt to quantize my models into the int8 ONNX model.
However, when I visualize its ONNX graph, I am not sure whether it is computing in full precision or int8 precision.

It seems like the input and weights are quantized into int8 to store in GPU memory. But before the matmul operations, these inputs and weights are still dequantized into full precision (e.g., fp32) for computing. Correct me if I am wrong.

I also profiled its execution runtime, and the kernel names are:
sm80_xmma_gemm_i8f32_i8i32_f32_tn_n_tilesize128x128x64_stage3_warpsize2x2x1_tensor16x8x32_execute_kernel_trt
sm80_xmma_gemm_f32f32_tf32f32_f32_nn_n_tilesize64x128x16_stage4_warpsize2x2x1_tensor16x8x8_execute_kernel_trt
sm80_xmma_gemm_i8f32_i8i32_f32_tn_n_tilesize128x128x64_stage3_warpsize2x2x1_tensor16x8x32_fused

So what do i8f32 and i8i32 mean? Does it mean the int8 weights/inputs are converted into f32 or int32?

realAsma · Answer 1 · Wed Jul 03 2024 01:36:51 GMT+0800 (China Standard Time)

@leeeizhang your ONNX graph looks correct to me. TRT will recognize the Q/DQ layers before matmul and replace it with quantized matmul (i.e, in TRT input and weights are quantized to int8 and the int8 GEMM will be used for matmul)

Lei Zhang · Answer 2 · Wed Jul 03 2024 07:13:49 GMT+0800 (China Standard Time)

@leeeizhang your ONNX graph looks correct to me. TRT will recognize the Q/DQ layers before matmul and replace it with quantized matmul (i.e, in TRT input and weights are quantized to int8 and the int8 GEMM will be used for matmul)

LGTM! Many thanks!