Improve warmup checking for max new tokens when using speculative decoding
tgaddair opened this issue · comments
Travis Addair commented
If speculative decoding is in use and the user wants to generate up to the max positional embeddings of the model, errors can arise at runtime causing a CUDA device-side assert error. We should do a better job detecting these errors during warmup, or gracefully handling this edge case per request.