Medusa Speculative Decoding
someone13574 opened this issue · comments
Recently there was a project called Medusa which was released. It basically trains more lm_head
's that instead of predicting the next token, they predict the token n+2, n+3, and n+4 before generating a tree of possible combinations of top-k possibilities for the upcoming tokens and evaluating them all at once with some clever masking and selecting one of the best ones. They get ~2x speedup and it looks like they are planning to integrate into llama.cpp, so I thought it would be a good fit for this project as well.
Links: Blog, Implementation, Models
![](https://private-user-images.githubusercontent.com/81528246/267169937-48345d12-8eec-4434-b783-e1d28813d23a.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTQ5OTYwNTQsIm5iZiI6MTcxNDk5NTc1NCwicGF0aCI6Ii84MTUyODI0Ni8yNjcxNjk5MzctNDgzNDVkMTItOGVlYy00NDM0LWI3ODMtZTFkMjg4MTNkMjNhLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA1MDYlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNTA2VDExNDIzNFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWJhMmEzYWRkYjZhZDMxMWU3NjE2OGNjMjU1YzY1Zjg5NTZhOWRiZjQ2NzIxMTk4NGQ1NDY4ZjU4NzI4YjczYjgmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.R1cO6jDS7Q8WujPkJCuBBAY7JIGawIWntvgKEx7vw6s)
![](https://private-user-images.githubusercontent.com/81528246/267169982-ac0a3995-68cb-4d5b-8f73-a2bac6302e2e.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTQ5OTYwNTQsIm5iZiI6MTcxNDk5NTc1NCwicGF0aCI6Ii84MTUyODI0Ni8yNjcxNjk5ODItYWMwYTM5OTUtNjhjYi00ZDViLThmNzMtYTJiYWM2MzAyZTJlLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA1MDYlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNTA2VDExNDIzNFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTVlNTQ2MmVkYmYyM2UzZWRmOTkwODI0MWVmOTI4OGIwMTE4OGU3MDQ4NTNiNDNiYTYzNTIyOWZhYzIzNDZlNTEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.esBKSz-rBHlxEDfoeyT_OVmme0vlgoKHV9YWFLZmuZY)
Ref to llama.cpp issue ggerganov/llama.cpp#3137