octoml / mlc-llm

Enable everyone to develop, optimize and deploy AI models natively on everyone's devices.

Home Page:https://mlc.ai/mlc-llm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Bug] Fix broken test for cache eviction with staging engine

masahi opened this issue · comments

@elvin-n I understand the problem you encountered in #158, and the possible solution 1. But I don't understand what you meant in 2., "Do decode inferencing in the worker, but do not return it back to the SequenceGenerationOutput until we achieve number of already generated tokens". Can you elaborate?

I do agree that the clamping we are doing in the worker is buggy and we don't want to do the same thing in the main process as well.

Oh maybe what you meant was: "Replace the clamping in the worker with len(generated_token_ids) steps of decoding, to recover the state before eviction". We now have a better solution using evaluate_multi_query, which can do the same thing as multiple steps of decoding.

Consolidate it into #163 since the underlying issue is the same.