deepseek-ai / DeepGEMM

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Repository from Github https://github.comdeepseek-ai/DeepGEMMRepository from Github https://github.comdeepseek-ai/DeepGEMM

Question: it seems that DeepSeek V3.1 is faster than DeepSeek V3(0324 version) when decoding under SGLang and H800.

Huixxi opened this issue · comments

commented

I don't know whether V3.1 is really faster than V3, but if it has something to do with its new UE8M0 and DeepGEMM is support this type of FP8 natively? Thanks.

On the same device, UE8M0 and FP32 SF offer the same level of performance. I guess the reason is that V3.1's output length is shorter than V3 averagely.

commented

Ok, thanks a lot.