Error generating text when using the exllama_HF loader and using a grammar file
GregorioBrc opened this issue · comments
Jose Gregorio Briceño commented
Describe the bug
When trying to generate a response with the exllamav2_HF loader and the roleplay grammar file, it generates a small text and throws an error in the console in different parts of the code.
Is there an existing issue for this?
- I have searched the existing issues
Reproduction
Load a GPTQ or Exllamav2 model with the ExllamaV2_HF loader
Load a grammar file, in my case I try with the roleplay file.
Try to generate a response
Screenshot
Logs
changed 22 packages, and audited 23 packages in 1s
3 packages are looking for funding
run `npm fund` for details
1 moderate severity vulnerability
To address all issues (including breaking changes), run:
npm audit fix --force
Run `npm audit` for details.
/content/text-generation-webui
02:30:11-152229 INFO Starting Text generation web UI
Running on local URL: http://127.0.0.1:7860
\CFUI finished loading, trying to launch localtunnel (if it gets stuck here localtunnel is having issues)
Running on public URL: https://702c9d7284631c1938.gradio.live
This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)
The password/enpoint ip for localtunnel is: 34.168.100.7
your url is: https://easy-cases-marry.loca.lt
02:32:48-563494 INFO Loading "Epiculous_Violet_Twilight-v0.2-exl2_4.0bpw"
## Warning: Flash Attention is installed but unsupported GPUs were detected.
2024-10-30 02:32:52.059218: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-30 02:32:52.091130: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-30 02:32:52.101257: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-30 02:32:52.135907: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-10-30 02:32:54.046606: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:600: UserWarning: `do_sample` is set to `False`. However, `min_p` is set to `0.0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `min_p`.
warnings.warn(
02:33:50-722732 INFO Loaded "Epiculous_Violet_Twilight-v0.2-exl2_4.0bpw" in 62.16 seconds.
02:33:50-726901 INFO LOADER: "ExLlamav2_HF"
02:33:50-727873 INFO TRUNCATION LENGTH: 16000
02:33:50-728797 INFO INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"
Warning: unrecognized tokenizer: using default token formatting
../aten/src/ATen/native/cuda/TensorCompare.cu:110: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion `probability tensor contains either `inf`, `nan` or element < 0` failed.
Traceback (most recent call last):
File "/content/text-generation-webui/modules/callbacks.py", line 61, in gentask
ret = self.mfunc(callback=_callback, *args, **self.kwargs)
File "/content/text-generation-webui/modules/text_generation.py", line 398, in generate_with_callback
shared.model.generate(**kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2215, in generate
result = self._sample(
File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 3195, in _sample
while self._has_unfinished_sequences(
File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2413, in _has_unfinished_sequences
elif this_peer_finished:
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Traceback (most recent call last):
Exception in thread Thread-3 (gentask):
File "/content/text-generation-webui/modules/text_generation.py", line 407, in generate_reply_HF
if output[-1] in eos_token_ids:
Traceback (most recent call last):
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/content/text-generation-webui/modules/text_generation.py", line 403, in generate_reply_HF
with generate_with_streaming(**generate_params) as generator:
File "/content/text-generation-webui/modules/callbacks.py", line 94, in __exit__
clear_torch_cache()
File "/content/text-generation-webui/modules/callbacks.py", line 105, in clear_torch_cache
torch.cuda.empty_cache()
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/memory.py", line 192, in empty_cache
torch._C._cuda_emptyCache()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Output generated in 4.44 seconds (0.90 tokens/s, 4 tokens, context 359, seed 1331220601)
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/content/text-generation-webui/modules/callbacks.py", line 68, in gentask
clear_torch_cache()
File "/content/text-generation-webui/modules/callbacks.py", line 105, in clear_torch_cache
torch.cuda.empty_cache()
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/memory.py", line 192, in empty_cache
torch._C._cuda_emptyCache()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
System Info
Google Colab