intel / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, DeepSeek, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, DeepSpeed, Axolotl, etc.

Repository from Github https://github.comintel/ipex-llmRepository from Github https://github.comintel/ipex-llm

TTFT of the distill qwen model is worser than the base model, is it a expected behavior?

dan20210809 opened this issue · comments

Hi, ipex team

I tested TTFT and TPOT for these 2 pairs of models on Core ultra5 125h:

DeepSeek-R1-Distill-Qwen-1.5B vs Qwen2.5-1.5B
DeepSeek-R1-Distill-Qwen-7B vs Qwen2.5-7B

How to test: https://github.com/intel/ipex-llm/tree/main/python/llm/dev/benchmark/all-in-one
Env: Win11 + GPU driver: 32.0.101.6632 + ipex-llm2.2.0b20250228
Config:
input-output: 1024-512
test_api: "transformer_int4_fp16_gpu_win"

--------Name----------------------||----1st token avg latency(ms)----||--2+ avg latency(ms/token)--
DeepSeek-R1-Distill-Qwen-1.5B 622.92 20.84
Qwen2.5-1.5B 514.1 20.44
DeepSeek-R1-Distill-Qwen-7B 2799.25 54.27
Qwen2.5-7B 2212.94 53.39

From my knowledge, the performance of the distill model should be the same with the base model, but the data shows they are different. Not sure if the testing is valid. Do you know the root cause?

Hi, ipex team

I tested TTFT and TPOT for these 2 pairs of models on Core ultra5 125h:

DeepSeek-R1-Distill-Qwen-1.5B vs Qwen2.5-1.5B DeepSeek-R1-Distill-Qwen-7B vs Qwen2.5-7B

How to test: https://github.com/intel/ipex-llm/tree/main/python/llm/dev/benchmark/all-in-one Env: Win11 + GPU driver: 32.0.101.6632 + ipex-llm2.2.0b20250228 Config: input-output: 1024-512 test_api: "transformer_int4_fp16_gpu_win"

--------Name----------------------||----1st token avg latency(ms)----||--2+ avg latency(ms/token)-- DeepSeek-R1-Distill-Qwen-1.5B 622.92 20.84 Qwen2.5-1.5B 514.1 20.44 DeepSeek-R1-Distill-Qwen-7B 2799.25 54.27 Qwen2.5-7B 2212.94 53.39

From my knowledge, the performance of the distill model should be the same with the base model, but the data shows they are different. Not sure if the testing is valid. Do you know the root cause?

Hi @dan20210809

The distilled models are based on Qwen2.5-Math-XX. They are similar to Qwen 2.5, but with a few differences in config.json. For example, DeepSeek-R1-Distill-Qwen-1.5B vs. Qwen-1.5. These differences in config lead to differences in latency.

13c13
13c13
<   "max_window_layers": 21,
---
>   "max_window_layers": 28,
19,21c19,21
<   "rope_theta": 10000,
<   "sliding_window": 4096,
<   "tie_word_embeddings": false,
---
>   "rope_theta": 1000000.0,
>   "sliding_window": 131072,
>   "tie_word_embeddings": true,
23c23
<   "transformers_version": "4.44.0",
---
>   "transformers_version": "4.40.1",
Model Base Model Download
DeepSeek-R1-Distill-Qwen-1.5B Qwen2.5-Math-1.5B 🤗 HuggingFace
DeepSeek-R1-Distill-Qwen-7B Qwen2.5-Math-7B 🤗 HuggingFace

@qiyuangong , I retested the models you mentioned, but gap still exists:

--------Name----------------------||----1st token avg latency(ms)----||--2+ avg latency(ms/token)
DeepSeek-R1-Distill-Qwen-1.5B --------591.69----------------------------32.3
Qwen2.5-Math-1.5B ---------------------489.53----------------------------31
DeepSeek-R1-Distill-Qwen-7B-----------2639.21--------------------------50.79
Qwen2.5-Math-7B------------------------2170.82--------------------------52

It seems that the distill model has worser TTFT than the base model, do you know the root cause?

@qiyuangong , I retested the models you mentioned, but gap still exists:

--------Name----------------------||----1st token avg latency(ms)----||--2+ avg latency(ms/token) DeepSeek-R1-Distill-Qwen-1.5B --------591.69----------------------------32.3 Qwen2.5-Math-1.5B ---------------------489.53----------------------------31 DeepSeek-R1-Distill-Qwen-7B-----------2639.21--------------------------50.79 Qwen2.5-Math-7B------------------------2170.82--------------------------52

It seems that the distill model has worser TTFT than the base model, do you know the root cause?

For the same reason, DeepSeek-R1-Distill-Qwen-1.5B and Qwen2.5-Math-1.5B also use different configurations.

https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/config.json
https://huggingface.co/Qwen/Qwen2.5-Math-1.5B/blob/main/config.json

Please check the config difference before benchmarking, i.e., config.json, tokenizer_config.json, generation_config.json etc.