Improve warmup checking for max new tokens when using speculative decoding

Question

Improve warmup checking for max new tokens when using speculative decoding

tgaddair opened this issue 2 months ago · comments

If speculative decoding is in use and the user wants to generate up to the max positional embeddings of the model, errors can arise at runtime causing a CUDA device-side assert error. We should do a better job detecting these errors during warmup, or gracefully handling this edge case per request.