oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Out of ranage integral type conversation attempted

Stargate256 opened this issue · comments

Describe the bug

When running inference over openAI compatable API with Perplexica or avante.nvim the error sometimes appears, after that happnes it doesn't work anymore until I restart the program. (It worked fine with Open WebUI)

Is there an existing issue for this?

  • I have searched the existing issues

Reproduction

  • Setup the program on Debian 12
  • run Qwen2.5-32B-Instruct-4.65bpw-h6-exl2
  • run inference over OpenAI compatable API (Perplexica, avante.nvim or something else)

Screenshot

2024-11-02-233644_hyprshot
2024-11-02-233722_hyprshot
2024-11-02-233759_hyprshot

Logs

Traceback (most recent call last):
  File "/root/llm/text-generation-webui/modules/text_generation.py", line 410, in generate_reply_HF
    new_content = get_reply_from_output_ids(output, state, starting_from=starting_from)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/llm/text-generation-webui/modules/text_generation.py", line 271, in get_reply_from_output_ids
    reply = decode(output_ids[starting_from:], state['skip_special_tokens'] if state else True)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/llm/text-generation-webui/modules/text_generation.py", line 181, in decode
    return shared.tokenizer.decode(output_ids, skip_special_tokens=skip_special_tokens)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/llm/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3999, in decode
    return self._decode(
           ^^^^^^^^^^^^^
  File "/root/llm/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 654, in _decode
    text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OverflowError: out of range integral type conversion attempted
Output generated in 2.50 seconds (7.99 tokens/s, 20 tokens, context 1131, seed 1200683755)

System Info

Envoronment: Proxmox VE vm
CPU: 6 virtual cores of Xeon E5-2697 v3
GPU: 2x Nvidia Tesla P100 16GB (PCIe passtrough)
OS: Debian 12
LLM: Qwen2.5-32B-Instruct-4.65bpw-h6-exl2

I am running into the same issue, also on Debian 12, on an older Intel CPU, while trying to run a Qwen2.5 exl2 model over the Open AI API (With cline and aider). In my case a few requests work and then it encounters this error after which the responses contain no/few characters. Unloading and loading the model again doesn't seem to help.

I'm running web-ui directly on physical hardware. I tried upgrading all the packages in my system which brought in a new kernel version but nothing changed after the upgrade.

Logs

13:27:44-948132 INFO     Starting Text generation web UI                        
13:27:44-952671 WARNING                                                         
                         You are potentially exposing the web UI to the entire  
                         internet without any access password.                  
                         You can create one with the "--gradio-auth" flag like  
                         this:                                                  
                                                                                
                         --gradio-auth username:password                        
                                                                                
                         Make sure to replace username:password with your own.  
13:27:44-954803 INFO     Loading the extension "openai"                         
13:27:45-089753 INFO     OpenAI-compatible API URL:                             
                                                                                
                         http://0.0.0.0:5000                                    
                                                                                

Running on local URL:  http://0.0.0.0:7860

13:27:51-237158 INFO     Loading                                                
                         "bartowski_Qwen2.5-Coder-14B-Instruct-exl2_4_25"       
/home/gradio/text-generation-webui/installer_files/env/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:600: UserWarning: `do_sample` is set to `False`. However, `min_p` is set to `0.0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `min_p`.
  warnings.warn(
13:27:58-621663 INFO     Loaded "bartowski_Qwen2.5-Coder-14B-Instruct-exl2_4_25"
                         in 7.38 seconds.                                       
13:27:58-623069 INFO     LOADER: "ExLlamav2_HF"                                 
13:27:58-624496 INFO     TRUNCATION LENGTH: 8000                                
13:27:58-625390 INFO     INSTRUCTION TEMPLATE: "Custom (obtained from model     
                         metadata)"                                             
^[[A/home/gradio/text-generation-webui/installer_files/env/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:590: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
/home/gradio/text-generation-webui/installer_files/env/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:600: UserWarning: `do_sample` is set to `False`. However, `min_p` is set to `0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `min_p`.
  warnings.warn(
Output generated in 4.25 seconds (21.15 tokens/s, 90 tokens, context 896, seed 2117395925)
Output generated in 2.80 seconds (22.11 tokens/s, 62 tokens, context 1011, seed 1939351019)
Traceback (most recent call last):
  File "/home/gradio/text-generation-webui/modules/text_generation.py", line 410, in generate_reply_HF
    new_content = get_reply_from_output_ids(output, state, starting_from=starting_from)
  File "/home/gradio/text-generation-webui/modules/text_generation.py", line 271, in get_reply_from_output_ids
    reply = decode(output_ids[starting_from:], state['skip_special_tokens'] if state else True)
  File "/home/gradio/text-generation-webui/modules/text_generation.py", line 181, in decode
    return shared.tokenizer.decode(output_ids, skip_special_tokens=skip_special_tokens)
  File "/home/gradio/text-generation-webui/installer_files/env/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 4004, in decode
    return self._decode(
  File "/home/gradio/text-generation-webui/installer_files/env/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 654, in _decode
    text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
OverflowError: out of range integral type conversion attempted
Output generated in 2.30 seconds (21.29 tokens/s, 49 tokens, context 1274, seed 1253585262)
Traceback (most recent call last):
  File "/home/gradio/text-generation-webui/modules/text_generation.py", line 410, in generate_reply_HF
    new_content = get_reply_from_output_ids(output, state, starting_from=starting_from)
  File "/home/gradio/text-generation-webui/modules/text_generation.py", line 271, in get_reply_from_output_ids
    reply = decode(output_ids[starting_from:], state['skip_special_tokens'] if state else True)
  File "/home/gradio/text-generation-webui/modules/text_generation.py", line 181, in decode
    return shared.tokenizer.decode(output_ids, skip_special_tokens=skip_special_tokens)
  File "/home/gradio/text-generation-webui/installer_files/env/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 4004, in decode
    return self._decode(
  File "/home/gradio/text-generation-webui/installer_files/env/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 654, in _decode
    text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
OverflowError: out of range integral type conversion attempted
Output generated in 0.74 seconds (1.36 tokens/s, 1 tokens, context 1153, seed 8088406)
13:32:10-609612 INFO     Loading                                                
                         "bartowski_Qwen2.5-Coder-14B-Instruct-exl2_4_25"       
13:32:16-405005 INFO     Loaded "bartowski_Qwen2.5-Coder-14B-Instruct-exl2_4_25"
                         in 5.79 seconds.                                       
13:32:16-407239 INFO     LOADER: "ExLlamav2_HF"                                 
13:32:16-408041 INFO     TRUNCATION LENGTH: 8000                                
13:32:16-408885 INFO     INSTRUCTION TEMPLATE: "Custom (obtained from model     
                         metadata)"                                             
Traceback (most recent call last):
  File "/home/gradio/text-generation-webui/modules/text_generation.py", line 410, in generate_reply_HF
    new_content = get_reply_from_output_ids(output, state, starting_from=starting_from)
  File "/home/gradio/text-generation-webui/modules/text_generation.py", line 271, in get_reply_from_output_ids
    reply = decode(output_ids[starting_from:], state['skip_special_tokens'] if state else True)
  File "/home/gradio/text-generation-webui/modules/text_generation.py", line 181, in decode
    return shared.tokenizer.decode(output_ids, skip_special_tokens=skip_special_tokens)
  File "/home/gradio/text-generation-webui/installer_files/env/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 4004, in decode
    return self._decode(
  File "/home/gradio/text-generation-webui/installer_files/env/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 654, in _decode
    text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
OverflowError: out of range integral type conversion attempted
Output generated in 1.30 seconds (0.77 tokens/s, 1 tokens, context 1176, seed 304963644)

lscpu

Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          36 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   4
  On-line CPU(s) list:    0-3
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Core(TM) i5-3570K CPU @ 3.40GHz
    CPU family:           6
    Model:                58
    Thread(s) per core:   1
    Core(s) per socket:   4
    Socket(s):            1
    Stepping:             9
    CPU(s) scaling MHz:   42%
    CPU max MHz:          3800.0000
    CPU min MHz:          1600.0000
    BogoMIPS:             6799.95
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mm
                          x fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_go
                          od nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est 
                          tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes xsave avx f16c rd
                          rand lahf_lm cpuid_fault epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid
                           fsgsbase smep erms xsaveopt dtherm ida arat pln pts md_clear flush_l1d
Virtualization features:  
  Virtualization:         VT-x
Caches (sum of all):      
  L1d:                    128 KiB (4 instances)
  L1i:                    128 KiB (4 instances)
  L2:                     1 MiB (4 instances)
  L3:                     6 MiB (1 instance)
NUMA:                     
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-3
Vulnerabilities:          
  Gather data sampling:   Not affected
  Itlb multihit:          KVM: Mitigation: VMX disabled
  L1tf:                   Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled
  Mds:                    Mitigation; Clear CPU buffers; SMT disabled
  Meltdown:               Mitigation; PTI
  Mmio stale data:        Unknown: No mitigations
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Not affected
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP disabled; RSB filling; PBRSB-eIBRS
                           Not affected; BHI Not affected
  Srbds:                  Vulnerable: No microcode
  Tsx async abort:        Not affected

free -m

               total        used        free      shared  buff/cache   available
Mem:           23982        1365       12767           4       10200       22617
Swap:           7999           0        7999

nvidia-smi

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060        On  | 00000000:01:00.0 Off |                  N/A |
|  0%   35C    P8              11W / 170W |    181MiB / 12288MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A       777      G   /usr/lib/xorg/Xorg                          167MiB |
|    0   N/A  N/A       964      G   /usr/bin/gnome-shell                          8MiB |
+---------------------------------------------------------------------------------------+

uname -a

Linux bash-3lpc 6.1.0-27-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.115-1 (2024-11-01) x86_64 GNU/Linux

python3 --version

Python 3.11.2

conda --version

conda 23.5.2

cat /etc/debian_version

12.8

git rev-parse HEAD

cc8c7ed2093cbc747e7032420eae14b5b3c30311

Actually, it seems like the ExLlamav2 loader works. Previously I was using the auto suggested ExLlamav2_HF loader.

Logs (prompts were sent from Aider)

14:04:59-537587 INFO     Loading "bartowski_Qwen2.5-Coder-14B-Instruct-exl2_4_25"                                    
14:05:06-734713 INFO     Loaded "bartowski_Qwen2.5-Coder-14B-Instruct-exl2_4_25" in 7.20 seconds.                    
14:05:06-736142 INFO     LOADER: "ExLlamav2"                                                                         
14:05:06-737239 INFO     TRUNCATION LENGTH: 8000                                                                     
14:05:06-738057 INFO     INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"                               
Output generated in 6.36 seconds (23.59 tokens/s, 150 tokens, context 1223, seed 186741678)
Output generated in 6.44 seconds (28.58 tokens/s, 184 tokens, context 791, seed 1585153886)
Output generated in 8.19 seconds (28.21 tokens/s, 231 tokens, context 1100, seed 1836764177)
Output generated in 7.53 seconds (29.60 tokens/s, 223 tokens, context 1371, seed 991524308)
Output generated in 10.12 seconds (29.56 tokens/s, 299 tokens, context 991, seed 455187287)
Output generated in 7.84 seconds (14.41 tokens/s, 113 tokens, context 4991, seed 1045094786)
Output generated in 1.10 seconds (16.42 tokens/s, 18 tokens, context 5210, seed 2042193150)
Output generated in 1.02 seconds (17.63 tokens/s, 18 tokens, context 5334, seed 1359728911)
Output generated in 1.10 seconds (19.17 tokens/s, 21 tokens, context 5458, seed 1694625255)
Output generated in 1.03 seconds (17.47 tokens/s, 18 tokens, context 5584, seed 1240670815)
Output generated in 1.10 seconds (19.02 tokens/s, 21 tokens, context 5708, seed 951578707)
Output generated in 1.11 seconds (18.96 tokens/s, 21 tokens, context 5834, seed 498927830)
Output generated in 1.11 seconds (18.88 tokens/s, 21 tokens, context 5960, seed 131397278)
Output generated in 1.12 seconds (18.82 tokens/s, 21 tokens, context 6086, seed 521276101)
Output generated in 1.13 seconds (18.59 tokens/s, 21 tokens, context 6212, seed 995108441)
Output generated in 1.13 seconds (18.54 tokens/s, 21 tokens, context 6338, seed 143805776)
Output generated in 2.15 seconds (22.81 tokens/s, 49 tokens, context 6464, seed 2070214832)
Output generated in 3.90 seconds (28.23 tokens/s, 110 tokens, context 4991, seed 805553205)
Output generated in 1.10 seconds (16.41 tokens/s, 18 tokens, context 5207, seed 1120525451)
Output generated in 1.01 seconds (17.75 tokens/s, 18 tokens, context 5331, seed 693321549)
Output generated in 1.10 seconds (19.16 tokens/s, 21 tokens, context 5455, seed 763349559)
Output generated in 0.86 seconds (16.28 tokens/s, 14 tokens, context 5581, seed 1450090146)
Output generated in 1.54 seconds (12.37 tokens/s, 19 tokens, context 1130, seed 1622652563)
Output generated in 7.42 seconds (30.73 tokens/s, 228 tokens, context 1175, seed 1043527426)
Output generated in 11.83 seconds (30.60 tokens/s, 362 tokens, context 762, seed 1054832108)
Output generated in 8.22 seconds (27.37 tokens/s, 225 tokens, context 1430, seed 600097550)
Output generated in 11.96 seconds (30.26 tokens/s, 362 tokens, context 866, seed 832375840)
Output generated in 8.24 seconds (28.16 tokens/s, 232 tokens, context 1188, seed 1514631067)
Output generated in 11.94 seconds (30.33 tokens/s, 362 tokens, context 830, seed 1119770377)
Output generated in 7.61 seconds (27.85 tokens/s, 212 tokens, context 1206, seed 83295453)
Output generated in 13.99 seconds (30.59 tokens/s, 428 tokens, context 836, seed 1989837235)
Output generated in 7.92 seconds (27.92 tokens/s, 221 tokens, context 1252, seed 1324992220)
Output generated in 15.83 seconds (30.70 tokens/s, 486 tokens, context 874, seed 151775036)
Output generated in 11.13 seconds (29.03 tokens/s, 323 tokens, context 1232, seed 746128985)

Just wanted to confirm that I have the same issue.

Model: bartowski/Qwen2.5-Coder-32B-Instruct-exl2 @ 4.25
Loader: ExLlamav2_HF