rmihaylov / falcontune

Tune any FALCON in 4-bit

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Inference speed for 7B models (triton backend, GTX 3090)

nikshepsvn opened this issue · comments

I'm running the 7B model on a 3090 and the inference time for the "How to prepare pasta?" is around 20-30s, is this expected?

image

The fast inference is related with 4-bit GPTQ models. Also, the triton code compiles in the first generation, that is why the real speedup comes from the second generation onwards.

If ones prefers a constant speed when using the 4-bit GPTQ models, then the cuda backend has to be used, but it is a bit slower.

And the cuda kernels have to be installed in addition.