Why is the inference FTL@1 longer after the vllm framework is quantized?

Question

luhairong11 opened this issue 5 months ago · comments

Tao Yang · Answer 1 · Tue Apr 02 2024 12:39:53 GMT+0800 (China Standard Time)

vLLM has already fixed this issue.

I will retest soon.

minyichen · Answer 2 · Sun Jul 07 2024 12:02:21 GMT+0800 (China Standard Time)

@ninehills
Is there any update on this? Or could you tell me in which version of vLLM this issue was resolved?

In version vLLM-0.4.3, my tests show that the quantized version's TTFT is still lower than the non-quantized version