vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Home Page:https://docs.vllm.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Usage]: Seems nn.module definition may affect the output tokens. Don't know the reason.

Zhenzhong1 opened this issue · comments

Your current environment

Env: CPU device
vllm: 0.4.2+cpu

from vllm import LLM
import torch

prompts = ["你好"]
llm1 = LLM(model="/home/zhenzhong/model/chatglm2-6b", trust_remote_code=True)  # Create an LLM.
torch.nn.Linear(in_features=4096,out_features=4608, bias=True, dtype=torch.bfloat16)
outputs1 = llm1.generate(prompts)  # Generate texts from the prompts.

llm2= LLM(model="/home/zhenzhong/model/chatglm2-6b", trust_remote_code=True)  # Create an LLM.
torch.nn.Linear(in_features=4096,out_features=4608, bias=True, dtype=torch.bfloat16)
outputs2 = llm2.generate(prompts)  # Generate texts from the prompts.

llm3= LLM(model="/home/zhenzhong/model/chatglm2-6b", trust_remote_code=True)  # Create an LLM.
outputs3 = llm3.generate(prompts)  # Generate texts from the prompts.

print("outputs1 = ", outputs1)
print("outputs2 = ", outputs2)
print("outputs3 = ", outputs3)

For this code, as long as I define the torch.nn.modules in the domain of the current vLLM model, it affects ouput token results even I don't use them. In other words, If I move theses nn.modules I don't use to the above of LLM() definition, it does't affect results.

llm1 is the same as llm2, because they both define the nn.module in the current model domain. But, llm3 is different because I don't define anything, and llm3 is the correct result I want.

Shouldn't three of them have the same result? Please check the screenshot or text.

Output screenshots:
image

Processed prompts: 100%|¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 1/1 [00:01<00:00,  1.22s/it]
outputs1 =  [RequestOutput(request_id=0, prompt='你好', prompt_token_ids=[64790, 64792, 36474, 54591], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=',我是小助手 本AI 欢迎你随时向我提问,我会尽力回答', token_ids=[31123, 33030, 54603, 42481, 35786, 23833, 30910, 32616, 54622, 34498, 46993, 37817, 31123, 35094, 40328, 33287], cumulative_logprob=-17.481587450020015, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715665805.6874118, last_token_time=1715665805.6874118, first_scheduled_time=1715665805.689108, first_token_time=1715665805.8463485, time_in_queue=0.0016961097717285156, finished_time=1715665806.759257), lora_request=None)]
outputs2 =  [RequestOutput(request_id=0, prompt='你好', prompt_token_ids=[64790, 64792, 36474, 54591], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=',我是小助手 本AI 欢迎你随时向我提问,我会尽力回答', token_ids=[31123, 33030, 54603, 42481, 35786, 23833, 30910, 32616, 54622, 34498, 46993, 37817, 31123, 35094, 40328, 33287], cumulative_logprob=-17.481587450020015, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715665811.4080832, last_token_time=1715665811.4080832, first_scheduled_time=1715665811.4091282, first_token_time=1715665811.539016, time_in_queue=0.0010449886322021484, finished_time=1715665812.7462144), lora_request=None)]
outputs3 =  [RequestOutput(request_id=0, prompt='你好', prompt_token_ids=[64790, 64792, 36474, 54591], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=',我是 ChatGLM2-6B, 我是基于大型语言模型', token_ids=[31123, 33030, 22011, 10461, 30944, 30943, 30941, 30978, 30949, 31123, 30910, 33030, 33053, 32997, 32330, 34030], cumulative_logprob=-8.741462323308497, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715665822.238591, last_token_time=1715665822.238591, first_scheduled_time=1715665822.2395456, first_token_time=1715665822.5107977, time_in_queue=0.0009546279907226562, finished_time=1715665823.461715), lora_request=None)]

Besides, if I change the ouput feature of torch.nn.module, it aslo affects output tokens.

prompts = ["你好"]
llm1 = LLM(model="/home/zhenzhong/model/chatglm2-6b", trust_remote_code=True)  # Create an LLM.
torch.nn.Linear(in_features=4096,out_features=8888, bias=True, dtype=torch.bfloat16)
outputs1 = llm1.generate(prompts)  # Generate texts from the prompts.
print(outputs1)

llm2= LLM(model="/home/zhenzhong/model/chatglm2-6b", trust_remote_code=True)  # Create an LLM.
torch.nn.Linear(in_features=4096,out_features=9999, bias=True, dtype=torch.bfloat16)
outputs2 = llm2.generate(prompts)

I only change the output_features, but results are different.
outputs:
image

outputs1 =  [RequestOutput(request_id=0, prompt='你好', prompt_token_ids=[64790, 64792, 36474, 54591], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=',是一名人工智能助手。 \n\n如果你需要帮助,请告诉我具体问题', token_ids=[31123, 38628, 34797, 42481, 31155, 30910, 13, 13, 32763, 31665, 31934, 30932, 55073, 38953, 32149, 31639], cumulative_logprob=-21.3015581928193, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715666711.2086165, last_token_time=1715666711.2086165, first_scheduled_time=1715666711.2102835, first_token_time=1715666711.3079636, time_in_queue=0.001667022705078125, finished_time=1715666712.208443), lora_request=None)]
outputs2 =  [RequestOutput(request_id=0, prompt='你好', prompt_token_ids=[64790, 64792, 36474, 54591], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=',小河流段便会非常活跃。很多体载货物的鱼类 difficult,', token_ids=[31123, 54603, 36773, 55005, 42237, 31685, 35203, 31155, 31679, 54618, 55387, 55466, 34090, 49426, 2529, 30932], cumulative_logprob=-96.62851423444226, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715666716.799589, last_token_time=1715666716.799589, first_scheduled_time=1715666716.8003457, first_token_time=1715666716.8765712, time_in_queue=0.0007567405700683594, finished_time=1715666718.0433056), lora_request=None)]

As you see, I don't use these nn.modules actually, but they affect results in fact. I provide 5 output results but they are all different. The only change is about nn.module.

Need some help. Thank you!

How would you like to use vllm

Seems nn.module definition may affect the output tokens. Don't know the reason.

This is quite interesting. Can you double check by setting seed?

If this is real, I suspect this has something to do with memory leak and pytorch caching allocator. Maybe we leaked some object reference, and when you create new nn module, pytorch caching allocator recycles some memory it thinks is not used anymore, but it is actually used somewhere?

I might be wrong anyway. If this is the case, the rootcase would be quite difficult to debug.

@simon-mo Hi

from vllm import LLM
import torch

prompts = ["你好"]
llm1 = LLM(model="/home/zhenzhong/model/chatglm2-6b", trust_remote_code=True, seed=666)  # Create an LLM.
torch.nn.Linear(in_features=4096,out_features=8888, bias=True, dtype=torch.bfloat16)
outputs1 = llm1.generate(prompts)  # Generate texts from the prompts.
print(outputs1)

llm2= LLM(model="/home/zhenzhong/model/chatglm2-6b", trust_remote_code=True, seed=666)  # Create an LLM.
torch.nn.Linear(in_features=4096,out_features=9999, bias=True, dtype=torch.bfloat16)
outputs2 = llm2.generate(prompts)  # Generate texts from the prompts.

llm3= LLM(model="/home/zhenzhong/model/chatglm2-6b", trust_remote_code=True, seed=666)  # Create an LLM.
outputs3 = llm3.generate(prompts)  # Generate texts from the prompts.

print("outputs1 = ", outputs1)
print("outputs2 = ", outputs2)
print("outputs3 = ", outputs3)

I set the same seed, but also output three different results. Acutally LLM() has the default seed (seed: int = 0).

outputs1 =  [RequestOutput(request_id=0, prompt='你好', prompt_token_ids=[64790, 64792, 36474, 54591], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=', p更 爱 你 要 是 你 要 是 你 要 是 你 要 是', token_ids=[31123, 281, 54664, 47802, 36474, 43159, 35369, 36474, 43159, 35369, 36474, 43159, 35369, 36474, 43159, 35369], cumulative_logprob=-41.74734868388623, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715824550.3473322, last_token_time=1715824550.3473322, first_scheduled_time=1715824550.3491716, first_token_time=1715824555.3297749, time_in_queue=0.0018393993377685547, finished_time=1715824620.9681613), lora_request=None)]
outputs2 =  [RequestOutput(request_id=0, prompt='你好', prompt_token_ids=[64790, 64792, 36474, 54591], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='老师和同学们,今天我带了人民调解委员会调解费收据 我不知道', token_ids=[42116, 32812, 31123, 31869, 54546, 54882, 54537, 31657, 36122, 32007, 36122, 55000, 54821, 54830, 34211, 32522], cumulative_logprob=-43.803544878959656, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715824629.7847252, last_token_time=1715824629.7847252, first_scheduled_time=1715824629.7856104, first_token_time=1715824633.9895625, time_in_queue=0.0008852481842041016, finished_time=1715824653.5920393), lora_request=None)]
outputs3 =  [RequestOutput(request_id=0, prompt='你好', prompt_token_ids=[64790, 64792, 36474, 54591], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=',我是人工智能助手。 根据用户名登录后,我的作用是提供咨询', token_ids=[31123, 33030, 34797, 42481, 31155, 47383, 32053, 54653, 36782, 54585, 31123, 31791, 31827, 54532, 31692, 32539], cumulative_logprob=-32.18759796023369, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715824663.3346176, last_token_time=1715824663.3346176, first_scheduled_time=1715824663.3352196, first_token_time=1715824663.549846, time_in_queue=0.0006020069122314453, finished_time=1715824664.6953938), lora_request=None)]