TTFT of the distill qwen model is worser than the base model, is it a expected behavior?

Question

TTFT of the distill qwen model is worser than the base model, is it a expected behavior?

dan20210809 opened this issue 8 months ago · comments

dan20210809 commented 8 months ago

Hi, ipex team

I tested TTFT and TPOT for these 2 pairs of models on Core ultra5 125h:

DeepSeek-R1-Distill-Qwen-1.5B vs Qwen2.5-1.5B
DeepSeek-R1-Distill-Qwen-7B vs Qwen2.5-7B

How to test: https://github.com/intel/ipex-llm/tree/main/python/llm/dev/benchmark/all-in-one
Env: Win11 + GPU driver: 32.0.101.6632 + ipex-llm2.2.0b20250228
Config:
input-output: 1024-512
test_api: "transformer_int4_fp16_gpu_win"

--------Name----------------------||----1st token avg latency(ms)----||--2+ avg latency(ms/token)--
DeepSeek-R1-Distill-Qwen-1.5B 622.92 20.84
Qwen2.5-1.5B 514.1 20.44
DeepSeek-R1-Distill-Qwen-7B 2799.25 54.27
Qwen2.5-7B 2212.94 53.39

From my knowledge, the performance of the distill model should be the same with the base model, but the data shows they are different. Not sure if the testing is valid. Do you know the root cause?

Qiyuan Gong · Answer 1 · Fri Mar 14 2025 20:58:39 GMT+0800 (China Standard Time)

Hi, ipex team

I tested TTFT and TPOT for these 2 pairs of models on Core ultra5 125h:

DeepSeek-R1-Distill-Qwen-1.5B vs Qwen2.5-1.5B DeepSeek-R1-Distill-Qwen-7B vs Qwen2.5-7B

How to test: https://github.com/intel/ipex-llm/tree/main/python/llm/dev/benchmark/all-in-one Env: Win11 + GPU driver: 32.0.101.6632 + ipex-llm2.2.0b20250228 Config: input-output: 1024-512 test_api: "transformer_int4_fp16_gpu_win"

--------Name----------------------||----1st token avg latency(ms)----||--2+ avg latency(ms/token)-- DeepSeek-R1-Distill-Qwen-1.5B 622.92 20.84 Qwen2.5-1.5B 514.1 20.44 DeepSeek-R1-Distill-Qwen-7B 2799.25 54.27 Qwen2.5-7B 2212.94 53.39

From my knowledge, the performance of the distill model should be the same with the base model, but the data shows they are different. Not sure if the testing is valid. Do you know the root cause?

Hi @dan20210809

The distilled models are based on Qwen2.5-Math-XX. They are similar to Qwen 2.5, but with a few differences in config.json. For example, DeepSeek-R1-Distill-Qwen-1.5B vs. Qwen-1.5. These differences in config lead to differences in latency.

13c13
13c13
<   "max_window_layers": 21,
---
>   "max_window_layers": 28,
19,21c19,21
<   "rope_theta": 10000,
<   "sliding_window": 4096,
<   "tie_word_embeddings": false,
---
>   "rope_theta": 1000000.0,
>   "sliding_window": 131072,
>   "tie_word_embeddings": true,
23c23
<   "transformers_version": "4.44.0",
---
>   "transformers_version": "4.40.1",

Model	Base Model	Download
DeepSeek-R1-Distill-Qwen-1.5B	Qwen2.5-Math-1.5B	🤗 HuggingFace
DeepSeek-R1-Distill-Qwen-7B	Qwen2.5-Math-7B	🤗 HuggingFace

dan20210809 · Answer 2 · Sun Mar 16 2025 17:41:06 GMT+0800 (China Standard Time)

@qiyuangong , I retested the models you mentioned, but gap still exists:

--------Name----------------------||----1st token avg latency(ms)----||--2+ avg latency(ms/token)
DeepSeek-R1-Distill-Qwen-1.5B --------591.69----------------------------32.3
Qwen2.5-Math-1.5B ---------------------489.53----------------------------31
DeepSeek-R1-Distill-Qwen-7B-----------2639.21--------------------------50.79
Qwen2.5-Math-7B------------------------2170.82--------------------------52

It seems that the distill model has worser TTFT than the base model, do you know the root cause?

Qiyuan Gong · Answer 3 · Sun Mar 16 2025 18:06:34 GMT+0800 (China Standard Time)

@qiyuangong , I retested the models you mentioned, but gap still exists:

--------Name----------------------||----1st token avg latency(ms)----||--2+ avg latency(ms/token) DeepSeek-R1-Distill-Qwen-1.5B --------591.69----------------------------32.3 Qwen2.5-Math-1.5B ---------------------489.53----------------------------31 DeepSeek-R1-Distill-Qwen-7B-----------2639.21--------------------------50.79 Qwen2.5-Math-7B------------------------2170.82--------------------------52

It seems that the distill model has worser TTFT than the base model, do you know the root cause?

For the same reason, DeepSeek-R1-Distill-Qwen-1.5B and Qwen2.5-Math-1.5B also use different configurations.

https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/config.json
https://huggingface.co/Qwen/Qwen2.5-Math-1.5B/blob/main/config.json

Please check the config difference before benchmarking, i.e., config.json, tokenizer_config.json, generation_config.json etc.