lucidrains / speculative-decoding

Explorations into some recent techniques surrounding speculative decoding

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Draft & Verify

Ryu1845 opened this issue Β· comments

Does this repository implement Draft & Verify?

@Ryu1845 hey! thanks for sharing that paper!

that looks quite close if not better than the naive early exit strategy (they predict which layers to skip through some heuristic) - but using the same model for speculating / drafting is definitely what i was going for.

i think my prophet transformer idea should be the best though (although i'm biased and still haven't ran any head to head πŸ˜†)

@Ryu1845 really think we are going to see a resurgence in adaptive computation research over the next year, like actually made practical

I think so too, thanks again for your work.
it looks like the official code for the paper will be uploaded here, but I'll keep an eye on this repo too πŸ˜‰

@Ryu1845 sounds good!

yea i think the main idea from the prophet idea is to take advantage of the cached last layer embedding from the large model, which should be superior to any early exit stuff. if you find me another paper that did that, would definitely read and implement

i'm also using a transformer on top, borrowing working ideas from hierarchical transformer line of research

yea i think the main idea from the prophet idea is to take advantage of the cached last layer embedding from the large model, which should be superior to any early exit stuff.

I don't know of any paper that does this but the medusa project aims to do just that I think.
https://together.ai/blog/medusa
https://github.com/FasterDecoding/Medusa

@Ryu1845 ohh yes, they totally did. so the only difference is i use a small transformer as the medusa / prophet heads

ok let me cite them as well

@Ryu1845 oh haha, they don't have a paper, just a github repo. may be the new trend

I'm guessing they'll release a paper once they've got a working prototype πŸ˜„
It looks like it's still a WIP FasterDecoding/Medusa#3
I actually don't know if it's running yet :/

@Ryu1845 ohh, so it isn't functional yet? maybe i'll send their group a message. solving batched spec decoding is a bit tricky with kv cache, but i found a solution (not sure if optimal)

~~I'm guessing they'll release a paper once they've got a working prototype πŸ˜„ ~~ It looks like it's still a WIP FasterDecoding/Medusa#3 I actually don't know if it's running yet :/

so it works or doesn't work?

it looks like it works, I'm sorry for the misunderstanding on my side

nice! that's amazing, i believe in that approach

@lucidrains
Amazing work!
Do you plan to release your results with early exit?
Thanks