llama.cpp sampling doesn't work in 1.15
GodEmperor785 opened this issue · comments
Describe the bug
llama.cpp models always gives exactly the same output (compared in winmerge to be sure), like they ignore any sampling options and seed. Sometimes the first output after loading model is a bit different, but every regenerate gives exactly the same output. I also tried with temperature=5. I can see that seed changes in the console. Even reloading the model or restarting whole webui doesn't help.
This seems to happen only with llama.cpp loader, I tried with some exl2 models and they worked fine - outputs were different.
This doesn't seem model specific as I tried multiple GGUF models (which worked as expected in the past), like Mistral Nemo and Small and Qwen 2.5 32B.
The same GGUFs worked before updating webui to 1.15 (with update_wizard_windows.bat as I usually did), so probably something in that update broke it?
This doesn't seem to be purely UI issue as I tried with SillyTavern over API too and effects were the same.
I also tried installing fresh copy of webui (git clone and start_windows.bat), but issue still happens on that fresh install.
There was a similar issue in the past #5451 - but in that case changing top_k helped - in my case it didn't help. Also the mentioned llama-cpp-python versions in that issue are very old (as the issue is old).
I don't know if source of this problem is in webui or llama-cpp-python.
Is there an existing issue for this?
- I have searched the existing issues
Reproduction
- Load any GGUF model with llama.cpp loader
- Generate any response and note it down
- Regenerate multiple times with high temperature
The regenerated outputs are always the same
Screenshot
No response
Logs
Not sure what logs might be needed here
System Info
Windows 10
RTX 3090 - GPU driver 565.90
webui 1.15 - commit d1af7a41ade7bd3c3a463bfa640725edb818ebaf (newest on branch main)
Small update:
I installed another copy of webui, but on last commit from version 1.14 (git checkout 073694b)
Installation wasn't straightforward due to some very long pydantic errors when launching webui, but after some googling I found this issue: jhj0517/Whisper-WebUI#258
They said to update gradio or use older fastapi, I checked webui commits a bit and used this one:
ac30b00#diff-4d7c51b1efe9043e44439a949dfd92e5827321b34082903477fd04876edb7552R8
So I manually used following commands (in cmd_windows.bat):
python -m pip install fastapi==0.112.4
python -m pip install pydantic==2.8.2
After that webui 1.14 loads without errors.
So I was finally able to test GGUF problem on 1.14 - result: it works as expected, same GGUF model as previously generates different outputs without any issues.
So it seems that something between version 1.14 and 1.15 broke it.
I don't know how to test specific llama-cpp-python versions because when I install another version from .whl file in https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels (which seems to be the place that webui gets the .whl files from) I get an error:
Traceback (most recent call last):
File "H:\text_AI\testoldversion\text-generation-webui\modules\ui_model_menu.py", line 231, in load_model_wrapper
shared.model, shared.tokenizer = load_model(selected_model, loader)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "H:\text_AI\testoldversion\text-generation-webui\modules\models.py", line 93, in load_model
output = load_func_map[loader](model_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "H:\text_AI\testoldversion\text-generation-webui\modules\models.py", line 278, in llamacpp_loader
model, tokenizer = LlamaCppModel.from_pretrained(model_file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "H:\text_AI\testoldversion\text-generation-webui\modules\llamacpp_model.py", line 85, in from_pretrained
result.model = Llama(**params)
^^^^^^^^^^^^^^^
File "H:\text_AI\testoldversion\text-generation-webui\installer_files\env\Lib\site-packages\llama_cpp_cuda\llama.py", line 369, in __init__
internals.LlamaModel(
File "H:\text_AI\testoldversion\text-generation-webui\installer_files\env\Lib\site-packages\llama_cpp_cuda\_internals.py", line 51, in __init__
model = llama_cpp.llama_load_model_from_file(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ctypes.ArgumentError: argument 2: TypeError: expected llama_model_params instance instead of llama_model_params
And I don't know how to fix that.
So I hope someone will know what is going on that could cause this problem between 1.14 and 1.15
It seems to be a llama-cpp-python bug. If you use llamacpp_HF you shouldn't encounter this problem.
Thanks for the suggestion, I just checked that llamacpp_HF works, but it requires additional manual work for each GGUF, like it is said in wiki:
Place your .gguf in a subfolder of models/ along with these 3 files: tokenizer.model, tokenizer_config.json, and special_tokens_map.json.
Simply loading a single GGUF file with llamacpp_HF fails.
Also it might be good to update instructions for llamacpp_HF , I found out that you can you can use tokenizer.json instead of tokenizer.model if it is missing (and it is missing for a lot of models, like Mistral Nemo, Qwen 2.5 or even llama 3.1).
So the difficulty is finding the necessary additional files, which are not in GGUF repos on huggingface. You have to manually go to original model repo and get those 3 files (it is not that much of a problem, but it is additional manual work to get a model to work).
EDIT: I just found the "llamacpp_HF creator" menu in webui models tab, it simplifies the process, but user still needs to have link to original model.
I will use llamacpp_HF with those additional files for now, but I hope normal llama.cpp loaed can work again the future.
If this problem is due to llama-cpp-python bug, maybe its version should be downgraded?
I'm running into that same error too. I even tried going back to version 1.14, but no luck. Makes me wonder if I somehow ended up with 1.15 (again) without realizing it during the install
When I tried using llamacpp_HF, the model felt completely different. It lost a lot of quality and coherence, and I'm not sure why
I saw there was an issue with the seed in the changes they made to llama.cpp, but they've already updated it. Hopefully that solves it
@DelteRT - You can check what commit your webui is on with git to make sure you have the right commit for 1.14, the command is: git rev-parse HEAD
Commit for 1.14 should be as mentioned in one of my previous comments. You can also compare list of installed packages (python -m pip list
in cmd_windows.bat), like llama-cpp-python - 1.14 should have 2.89, 1.15 has 3.1.
Also if you didn't have problems in the UI when on 1.14 (and some very long errors in console) - it is probably not 1.14 as I needed to do some modifications to libraries (as mentioned earlier).
Here #6431 (comment) I described how I did it with 1.14 (including exact commit and errors)
Also do you have a link to this "issue with the seed in the changes they made to llama.cpp"?
@DelteRT - You can check what commit your webui is on with git to make sure you have the right commit for 1.14, the command is:
git rev-parse HEAD
Commit for 1.14 should be as mentioned in one of my previous comments. You can also compare list of installed packages (python -m pip list
in cmd_windows.bat), like llama-cpp-python - 1.14 should have 2.89, 1.15 has 3.1. Also if you didn't have problems in the UI when on 1.14 (and some very long errors in console) - it is probably not 1.14 as I needed to do some modifications to libraries (as mentioned earlier). Here #6431 (comment) I described how I did it with 1.14 (including exact commit and errors) Also do you have a link to this "issue with the seed in the changes they made to llama.cpp"?
I misunderstood about the seed. It was just a change in the documentation
There is definitely something in the latest update that messes up the way it receives the parameters when using the API, but I couldn't figure out what it is
P.S: I followed your instructions, and finally managed to revert to 1.14 (I had a strange error, but I solved it by downgrading the transformers) Now everything is working correctly, thanks!
Same issue, all regens are the same when loading via llama.cpp. I cannot properly install 1.14 - I cant even get it to start properly when loading the openai extension. Despite the workaround and instructions above - I am unable to fix this myself. Great -.-
So either 1.15 receives a FIX, as this is clearly a bug, or no more chat for me =)
Edit: another go at 1.14 worked, i stupidly forgot to use cmd_windows.bat - to properly activate the env. Nw it works on 1.14 again, regen is back. Thx @ GodEmperor785
Same issue here, Windows, CPU/GGUF model. I've tried downloading the 1.14 files but it seems to just download the newest version when I run start_windows.bat the first time.
Edit: Sorry for the slight off-topic bit I'm unfamiliar with git so I wasn't sure if this behavior was an issue or not. I also wanted to add another "yeah I'm having this issue too" post with the OS I was using to help confirm this was happening to people.
Same issue here, Windows, CPU/GGUF model. I've tried downloading the 1.14 files but it seems to just download the newest version when I run start_windows.bat the first time.
That's a bit off-topic ^^, but I see what you're saying. => Don't instantly run the start_windows.bat - instead, download either the 1.14 release zip or do a git clone OOBA-URL
in the git Bash (or powershell), then cd text-generation-webui
to switch into the just pulled directory. To properly convey that we ONLY want this particular copy, switch to a different release build via git checkout 073694b
- and then run start_windows.bat (normally, but we need to change two packages down below).
Naturally, when you fire up the updater script and use the A) option, it'll look for updates and will pull the latest data - and we dont want this script to touch our "requirements.txt".
Now, an important additional step is required - since there are some minor hurdles. Before running any scripts after the checkout, use the “cmd_windows.bat” to open the virtual environment - and inside that new command prompt, use Godemporer’s recommendation and copy-pasta these two specific packages:
python -m pip install fastapi==0.112.4
python -m pip install pydantic==2.8.2
After this, finally, fire up the start_windows.bat as you would normally. Ooba should load in vanilla config, so all we eed to do now is to supply the desired CMD_flags.txt (I use --api --listen --listen-port 8860 --extensions openai
for example) and Bob's your Uncle.
I hope this helps Dave =)
I did some more testing and it seems that changing "mirostat_mode" in Parameters to something other than default value (default is 0, so set it to 1 or 2) makes the output be different with each regenerate.
I don't know what the values 0, 1 and 2 in "mirostat_mode" do exactly.
So it seems that mirostat_mode=0 breaks something in 1.15.
I did some more testing and it seems that changing "mirostat_mode" in Parameters to something other than default value (default is 0, so set it to 1 or 2) makes the output be different with each regenerate. I don't know what the values 0, 1 and 2 in "mirostat_mode" do exactly.
So it seems that mirostat_mode=0 breaks something in 1.15.
Indeed. I have the same issue and experience. The only thing that affects the output is changing the mirostat_mode
In my case, I fix the problem setting TOP P to 0.99 instead of 0 or 1, is seem like TOP P is broken.