[BUG/Help] 首token时延受输入长度影响，显著线性增长

Question

[BUG/Help] 首token时延受输入长度影响，显著线性增长

woaipichuli opened this issue 8 months ago · comments

woaipichuli commented 8 months ago

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

首token在GPU上验证时延随输入长度增长明显，基本是线性倍增，从512到2048，首token的时延基本上涨接近4倍，从500ms上涨到1.8s
输入部分应该是并行计算的，为什么时延会增长这么大呢？

Expected Behavior

No response

Steps To Reproduce

tokenizer = AutoTokenizer.from_pretrained(base_model_name_or_path, trust_remote_code=True)
base_model = AutoModel.from_pretrained(base_model_name_or_path, trust_remote_code=True, revision=True)
model = PeftModel.from_pretrained(base_model, peft_model_id,torch_dtype=torch.float16)
model.cuda()

str = "测试文本"
pt_data = tokenizer(str, return_tensors="pt", padding=True).to('cuda')
gen_kwargs = {"max_length": pt_data["input_ids"].shape[-1] + 1, "num_beams": 1, "do_sample": False,
"top_p": 0.8,
"temperature": 0, "logits_processor": logits_processor}
outputs = model.generate(**pt_data, **gen_kwargs)

Environment

V100 T4 两个GPU上验证了该问题

Anything else?

No response