[Bug] Fix broken test for cache eviction with staging engine
masahi opened this issue · comments
@elvin-n I understand the problem you encountered in #158, and the possible solution 1. But I don't understand what you meant in 2., "Do decode inferencing in the worker, but do not return it back to the SequenceGenerationOutput until we achieve number of already generated tokens". Can you elaborate?
I do agree that the clamping we are doing in the worker is buggy and we don't want to do the same thing in the main process as well.
Oh maybe what you meant was: "Replace the clamping in the worker with len(generated_token_ids)
steps of decoding, to recover the state before eviction". We now have a better solution using evaluate_multi_query
, which can do the same thing as multiple steps of decoding.