oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error generating text when using the exllama_HF loader and using a grammar file

GregorioBrc opened this issue · comments

Describe the bug

When trying to generate a response with the exllamav2_HF loader and the roleplay grammar file, it generates a small text and throws an error in the console in different parts of the code.

Is there an existing issue for this?

  • I have searched the existing issues

Reproduction

Load a GPTQ or Exllamav2 model with the ExllamaV2_HF loader

Load a grammar file, in my case I try with the roleplay file.

Try to generate a response

Screenshot

imagen
imagen
imagen

Logs

changed 22 packages, and audited 23 packages in 1s

3 packages are looking for funding
  run `npm fund` for details

1 moderate severity vulnerability

To address all issues (including breaking changes), run:
  npm audit fix --force

Run `npm audit` for details.
/content/text-generation-webui
02:30:11-152229 INFO     Starting Text generation web UI                                            

Running on local URL:  http://127.0.0.1:7860

\CFUI finished loading, trying to launch localtunnel (if it gets stuck here localtunnel is having issues)

Running on public URL: https://702c9d7284631c1938.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)
The password/enpoint ip for localtunnel is: 34.168.100.7
your url is: https://easy-cases-marry.loca.lt
02:32:48-563494 INFO     Loading "Epiculous_Violet_Twilight-v0.2-exl2_4.0bpw"                       
 ## Warning: Flash Attention is installed but unsupported GPUs were detected.
2024-10-30 02:32:52.059218: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-30 02:32:52.091130: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-30 02:32:52.101257: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-30 02:32:52.135907: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-10-30 02:32:54.046606: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:600: UserWarning: `do_sample` is set to `False`. However, `min_p` is set to `0.0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `min_p`.
  warnings.warn(
02:33:50-722732 INFO     Loaded "Epiculous_Violet_Twilight-v0.2-exl2_4.0bpw" in 62.16 seconds.      
02:33:50-726901 INFO     LOADER: "ExLlamav2_HF"                                                     
02:33:50-727873 INFO     TRUNCATION LENGTH: 16000                                                   
02:33:50-728797 INFO     INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"              
Warning: unrecognized tokenizer: using default token formatting
../aten/src/ATen/native/cuda/TensorCompare.cu:110: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion `probability tensor contains either `inf`, `nan` or element < 0` failed.
Traceback (most recent call last):
  File "/content/text-generation-webui/modules/callbacks.py", line 61, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
  File "/content/text-generation-webui/modules/text_generation.py", line 398, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2215, in generate
    result = self._sample(
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 3195, in _sample
    while self._has_unfinished_sequences(
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2413, in _has_unfinished_sequences
    elif this_peer_finished:
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
Exception in thread Thread-3 (gentask):
  File "/content/text-generation-webui/modules/text_generation.py", line 407, in generate_reply_HF
    if output[-1] in eos_token_ids:
Traceback (most recent call last):
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/content/text-generation-webui/modules/text_generation.py", line 403, in generate_reply_HF
    with generate_with_streaming(**generate_params) as generator:
  File "/content/text-generation-webui/modules/callbacks.py", line 94, in __exit__
    clear_torch_cache()
  File "/content/text-generation-webui/modules/callbacks.py", line 105, in clear_torch_cache
    torch.cuda.empty_cache()
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/memory.py", line 192, in empty_cache
    torch._C._cuda_emptyCache()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Output generated in 4.44 seconds (0.90 tokens/s, 4 tokens, context 359, seed 1331220601)
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/content/text-generation-webui/modules/callbacks.py", line 68, in gentask
    clear_torch_cache()
  File "/content/text-generation-webui/modules/callbacks.py", line 105, in clear_torch_cache
    torch.cuda.empty_cache()
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/memory.py", line 192, in empty_cache
    torch._C._cuda_emptyCache()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

System Info

Google Colab