predibase / lorax

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

Home Page:https://loraexchange.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Improve warmup checking for max new tokens when using speculative decoding

tgaddair opened this issue · comments

If speculative decoding is in use and the user wants to generate up to the max positional embeddings of the model, errors can arise at runtime causing a CUDA device-side assert error. We should do a better job detecting these errors during warmup, or gracefully handling this edge case per request.