Speculative decoding support with Eager streaming mode

Question

Speculative decoding support with Eager streaming mode

atiorh opened this issue 3 months ago · comments

The Eager streaming mode implies that we predict the same token at least twice. This is a great opportunity to design a speculative decoding technique that can leverage a fast draft model* and amortize the redundant predictions while accelerating the overall pipeline.

Draft: distil-large-v3, Oracle: large-v3. They share AudioEncoders, only TextDecoders are different