[Bug] Fix broken test for cache eviction with staging engine

Question

[Bug] Fix broken test for cache eviction with staging engine

masahi opened this issue 9 months ago · comments

https://github.com/octoml/mlc-llm/blob/batch-serving/serve/tests/unittest/test_staging_engine.py#L216-L217

masahi · Answer 1 · Thu Feb 01 2024 19:37:44 GMT+0800 (China Standard Time)

@elvin-n I understand the problem you encountered in #158, and the possible solution 1. But I don't understand what you meant in 2., "Do decode inferencing in the worker, but do not return it back to the SequenceGenerationOutput until we achieve number of already generated tokens". Can you elaborate?

I do agree that the clamping we are doing in the worker is buggy and we don't want to do the same thing in the main process as well.

masahi · Answer 2 · Thu Feb 01 2024 19:45:14 GMT+0800 (China Standard Time)

Oh maybe what you meant was: "Replace the clamping in the worker with len(generated_token_ids) steps of decoding, to recover the state before eviction". We now have a better solution using evaluate_multi_query, which can do the same thing as multiple steps of decoding.

masahi · Answer 3 · Thu Feb 01 2024 19:52:11 GMT+0800 (China Standard Time)

Consolidate it into #163 since the underlying issue is the same.