Speculative decoding support with Eager streaming mode
atiorh opened this issue · comments
Atila Orhon commented
The Eager streaming mode implies that we predict the same token at least twice. This is a great opportunity to design a speculative decoding technique that can leverage a fast draft model* and amortize the redundant predictions while accelerating the overall pipeline.
- Draft: distil-large-v3, Oracle: large-v3. They share AudioEncoders, only TextDecoders are different