Request for Documentation of custom quantization algorithm / external quantized weight for AWQ
nuxlear opened this issue · comments
Hello, I have had a trouble that quantized model with AWQ has performance degradation more than expected.
I know that ModelOpt provides optimized kernels and quantization algorithms for fast quantization,
but the guide and documentation are focused on the custom forward-loop in calibration time.
I could not find the solution to apply other quantization method and keep using existing kernels for fast inference.
Specifically, I want to load my quantized weights which have same structures of AWQ (scaling_factors, shifts optionally).
Is there any guide or document about applying quantized weight in quantization process?
else, is there any plan to implement about it?
Hi @nuxlear , ModelOpt does simulated quantization and perf improvement needs to be achieved with TensorRT-LLM. Are you observing perf degradation with ModelOpt or with TRTLLM?
We will have initial support for quantized weight in the next several releases, but the goal is just for memory saving, not speedup. For speedup, you should refer to the guide of TRTLLM deployment
Thank you for replying @RalphMao . I meant the performance as English task performance such as OpenLLM leaderboard. I apologize for the confusion.
I use TRTLLM with Triton inference server right now, and I have experienced speedup and memory saving thanks to it.
My problem is that my AWQ-quantized model has more degradation from FP16 on OpenLLM benchmark than I expected.
It sounds great if the initial support for quantized weight contains some customization methods such as loading pre-quantized weight.
As I understood, you mentioned that there will be such features soon, right?