Inference speed for 7B models (triton backend, GTX 3090)

Question

Inference speed for 7B models (triton backend, GTX 3090)

nikshepsvn opened this issue a year ago · comments

I'm running the 7B model on a 3090 and the inference time for the "How to prepare pasta?" is around 20-30s, is this expected?

Rumen Mihaylov · Answer 1 · Thu Jun 01 2023 23:08:30 GMT+0800 (China Standard Time)

The fast inference is related with 4-bit GPTQ models. Also, the triton code compiles in the first generation, that is why the real speedup comes from the second generation onwards.

Rumen Mihaylov · Answer 2 · Thu Jun 01 2023 23:09:23 GMT+0800 (China Standard Time)

If ones prefers a constant speed when using the 4-bit GPTQ models, then the cuda backend has to be used, but it is a bit slower.

Rumen Mihaylov · Answer 3 · Thu Jun 01 2023 23:09:45 GMT+0800 (China Standard Time)

And the cuda kernels have to be installed in addition.