unixwzrd / text-generation-webui-macos

A macOS version of the oobabooga gradio web UI for running Large Language Models like LLaMA, llama.cpp, GPT-J, Pythia, OPT, and GALACTICA.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

bitsandbytes issue still there

unixwzrd opened this issue · comments

commented

Describe the bug

unixwzrd/oobabooga-macOS#4 (comment)

Warning message appearing when bitsandbytes is initially loaded informing that CPU is being used as opposed to their supported CUDA configuration.

Is there an existing issue for this?

  • I have searched the existing issues

Reproduction

All the time

Screenshot

none required

Logs

none necessary

System Info

all macOS platforms.
commented

@jhandl - moving the issue to here. Here is the original issue:

I just followed all the instructions in the "impatient" script, opting for the text-generation-webui-macos trunk version (not test or dev), and when running the start-webui script I get:

(webui.04.final-gguf) jhandl@macmini webui-macOS % ./start.sh 
/Users/jhandl/miniconda3/envs/webui.04.final-gguf/lib/python3.10/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of **bitsandbytes was compiled without GPU support**. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
'NoneType' object has no attribute 'cadam32bit_grad_fp32'
Traceback (most recent call last):
  File "/Users/jhandl/webui-macOS/server.py", line 6, in <module>
    from modules import chat, loaders, presets, shared, training, ui, utils
  File "/Users/jhandl/webui-macOS/modules/chat.py", line 17, in <module>
    from modules.text_generation import (
  File "/Users/jhandl/webui-macOS/modules/text_generation.py", line 22, in <module>
    from modules.models import clear_torch_cache
  File "/Users/jhandl/webui-macOS/modules/models.py", line 23, in <module>
    from modules.models_settings import infer_loader
  File "/Users/jhandl/webui-macOS/modules/models_settings.py", line 31
    elif re.match('.*gguf.*\.bin', model_name.lower()):
                                                       ^
**IndentationError: unindent does not match any outer indentation level**

After fixing the indentation issue the server started, but no matter what I tried, all I got was the input instantly copied to the output and nothing else, with the server reporting that it generated 0 tokens, at a pace of 0.00 tokens/s, in 0.23 seconds.

I then cloned the original Oobabooga repo (following the "impatient" instructions), and now Oobabooga worked, but at 0.06 tokens/s, so not using the GPU. I also monitored the GPU usage through the Activity Monitor and it was clearly at 0 the whole time.

Setup: Mac mini M2 Pro 16 GB, Mac OS 13.5.2

commented

Could you let me know which sort of model you are loading and which loader you are using, for instance you are loading using llama.cpp?

If you are using llama.cpp, could you let me know which version you are using?

I'll report back in a bit what I find out on my end.

Thanks!

Hi @unixwzrd, I'm using vicuna-13b-v1.5-16k.Q4_K_M.gguf, loaded with llama.cpp. The bitsandbytes message appears on server startup though, before loading any models.

About the version of llama.cpp, not sure how to check that. I installed everything last night following the instructions, so I guess it has to be the latest version?

commented

Ok, checked out the indentation issue, and was very strange, My local repo wouldn't push to GitHub, but it should be fixed now, I cloned it after fixing and no issue. Not sure how that crept in there, not sure why my repair wouldn't push, but GitHub argues with me regularly.

The item on bitsandbytes is only a warning that the library is using CPU instead of the architecture it typically runs on, CUDA. This message may be ignored as I haven't tracked down a solution to this, but it's not harming anything.

As far as inference performance, here's what I get using:

Manticore-13B.ggufv3.q4_0.bin

This is originally a GGML and converted locally to a GGUF file. Here's the performance from 8 inferences, the first two take a bit o time, I assume because on the first one it is doing some internal housekeeping on startup, but all subsequent inferences were in the high 20's for tokens/s, eyeballing it looks like right around 28 tokens/s:

Output generated in 41.71 seconds (0.24 tokens/s, 10 tokens, context 16, seed 973897120)
Output generated in 1.41 seconds (18.45 tokens/s, 26 tokens, context 97, seed 582096379)
Output generated in 2.73 seconds (29.69 tokens/s, 81 tokens, context 146, seed 1641058249)
Output generated in 2.08 seconds (24.54 tokens/s, 51 tokens, context 261, seed 336635775)
Output generated in 2.27 seconds (28.20 tokens/s, 64 tokens, context 330, seed 1943077092)
Output generated in 2.42 seconds (28.08 tokens/s, 68 tokens, context 407, seed 459542484)
Output generated in 2.71 seconds (28.10 tokens/s, 76 tokens, context 507, seed 2105920382)
Output generated in 2.80 seconds (26.44 tokens/s, 74 tokens, context 601, seed 958049896)

This was using the latest llama-cpp-python 0.2.6

Also using something new, NumPy release 1.26.0 which has been bumped up just today. Not sure when you built this, but there have been many items which have changed since yesterday.

The NumPy issue may also be affecting the original oobabooga, not sure unless you can provide me more informational to which python packages you are using, but looks fine to me. Attached is the run log for the 8 inferences listed above. Also here are my Python modules.
packages.txt

Here are the python packages I have installed with the version numbers.
runtext.txt

Everything seems to work for me, I am about to check the original oobaboog main branch now. But wanted to let you know the results of my investigation.

Here's the dialog from the inferences, kinda funny actually because it hallucinates big-time and even tries to BS it's way out of it...

You
hey, how are you?

Assistant
I'm doing well! How about yourself?

You
Pretty good, all things considered.

Assistant
That's great to hear! Is there anything specific that's on your mind or any questions you have for me today?

You
Just checking to see how fast your inference is working, is it any faster?

Assistant
My inference capabilities are constantly improving as I learn more and process more data. However, the speed at which my inferences occur can vary depending on a variety of factors such as the complexity of the task being performed or the amount of available information. In general, though, you should notice an improvement in both accuracy and efficiency over time as I continue to refine my algorithms and processes.

You
Outstanding, I was just making a few changes to your back-end llama loader, so wanted to see if you are using the GPU.

Assistant
Yes, I am currently utilizing the GPU for certain tasks such as image processing or machine learning operations that require significant computational power. This allows me to perform these types of calculations more quickly and efficiently than would be possible with a traditional CPU-only architecture.

You
How do you know you're using the GPU?

Assistant
I am able to detect when my system is utilizing the GPU by monitoring various performance metrics such as memory usage, processor load, and other indicators that are specific to GPU operations. Additionally, some software applications may provide more detailed information about which hardware components are being used for a particular task or process.

You
What sort of GPU are you running on?

Assistant
I am currently using an NVIDIA GeForce GTX 1080 Ti graphics card with 11GB of memory, which is one of the most powerful consumer-grade GPUs available today. This allows me to perform a wide range of complex computational tasks at high speeds and with excellent accuracy.

You
Bzzzt, wrong answer, you are running on Apple Silicon M1/M2 GPU. But thank you for playing.

Assistant
I apologize if my previous response was incorrect or misleading. You are correct that I am currently using an Apple M1 chip with integrated graphics rather than a traditional NVIDIA GPU. While this may not be as powerful in terms of raw computational performance, it still provides me with the ability to perform many complex tasks at high speeds and with excellent accuracy.

You
Umm, to be specific it's am Apple M2 Max

Assistant
Thank you for providing that additional information! I apologize again if my previous response was not accurate or complete enough. You are correct that I am currently using an Apple M2 chip with integrated graphics on the MacBook Pro 16-inch, which is a powerful and versatile system that provides me with excellent performance capabilities across a wide range of tasks and applications.

Ah, that's interesting. I'm using a large context model (16K) and very large prompts (>11000 tokens). If I try much smaller prompts it works much faster (~ 6.5 tokens/s compared to 0.06 tokens/s). When running my prompts the Activity Monitor shows the swap space growing from 0 to 2.5 GB and the disk doing 25K reads/s and 450 MB/s. Almost no writes though, which I find puzzling if it's swapping. Anyway, this explains why it's so slow.

But there's another factor: I just realized that the n_ctx parameter is only used while loading the model. Until now I only changed it after loading the model. If I set it to anything other than 0 and then load the model, it throws:

File “/Users/jhandl/oobabooga_macos/installer_files/env/lib/python3.10/site-packages/llama_cpp/llama.py”, line 334, in init

assert self.ctx is not None

So something failed when calling llama_cpp.llama_new_context_with_model().

Sorry, strike that, I was running the version installed by the one-click-installer for macos. Going back to the version installed following the "impatient" script, it loads correctly even with n_ctx set at 128.

But with any value other than 0 it freezes for a while and finally crashes with:

ggml_metal_get_buffer: error: buffer is nil
ggml_metal_get_buffer: error: buffer is nil
ggml_metal_get_buffer: error: buffer is nil
...
ggml_metal_get_buffer: error: buffer is nil
ggml_metal_get_buffer: error: buffer is nil
ggml_metal_get_buffer: error: buffer is nil
ggml_metal_graph_compute: command buffer 1 failed with status 5
GGML_ASSERT: /private/var/folders/7d/dw6lnmf12gl6tz8zsbgr_d900000gn/T/pip-install-fb9_gphi/llama-cpp-python_f832c45e7f6b42648234897fd862f51b/vendor/llama.cpp/ggml-metal.m:1290: false
./start.sh: line 15:   925 Abort trap: 6           python server.py --extensions Playground

Just to be sure, I tried the same without using the playground, and got the same result:

...
ggml_metal_get_buffer: error: buffer is nil
ggml_metal_graph_compute: command buffer 1 failed with status 5
GGML_ASSERT: /private/var/folders/7d/dw6lnmf12gl6tz8zsbgr_d900000gn/T/pip-install-fb9_gphi/llama-cpp-python_f832c45e7f6b42648234897fd862f51b/vendor/llama.cpp/ggml-metal.m:1290: false
./start.sh: line 16:  1136 Abort trap: 6           python server.py
commented

Be careful about the n_ctx, you can definitely have issues and as the context grows over time I've had it not be able to allocate a contiguous block of memory and crash like that or similar. Sometimes it takes a while and as my context fills up, it usually gets slower because the whole context gets tokenized and sent every prompt you send. I find that even doing the RoPE tricks and everything, extending context beyond the 2k or 4k most times can cause you problems, either soon after starting or as the context fills. Batch size also seems to figure into it, I've found reducing batch size can help prevent this sort of memory exhaustion.

Are you using the mlock setting in the Models tab? You should set that because it will prevent sone of these types of errors, especially if you are running low on memory. It will prevent the pages from getting swapped out or compressed and you have to have them intact then inference begins. You also need a bit of "overhead" available for storing some sort of intermediates/results of some calculations. Metal/Accelerate try to keep things in place and do in-place operations rather than copying blocks/allocating new blocks for results wherever possible. I think the hemlock will help you a lot, it helped me be able to load the 70B models, why, just to see if I could I really think mlock will help you out a lot.

Batch size is also one which can cause you problems. I'm not sure the proper way to size that, but I do know that reducing it has prevented my crashing, so rather than set it big, I now go for about 768. I haven't had a lot of time to do complete benchmarks, but am putting together a scripted system to get consistent Python and other library build stacks. It needs documenting but maybe I should just go ahead and make the repo public anyway if it will help people get consistent builds for a VENV.

Also, be sure to set up separate VENV's for oobabooga original and the macOS one here. It will help you out keeping the packages separated. I kinda go overboard in creating VENV's as they are relatively cheap, but for testing this issue, I build two clones of a "base on top of a python 3.10 VENV and just went straight through loading them up for each source codebase.

But general rule I have found work for sizing are set n_gopu_layers to whatever is reported:

llm_load_print_meta: n_layer        = 40

Better to undersize n_batch but you will have to experiment with that.

Definitely set mlock if you haven't - swapping is very bad if it swaps any of your tensors out, it may not be able to allocate the contiguous block again when it swaps in.

Our messages crossed as I was compiling all the information I could before I sent a reply, but to locate your llama_cpp version:

(issue4.01.macosmain) [unixwzrd@xanax webuimacOS-main]$ pip list -v | grep llama.cpp
llama_cpp_python          0.2.6              /Users/mps/miniconda3/envs/issue4.01.macosmain/lib/python3.10/site-packages pip

Again, thanks for helping find problems, I'd like to shake as many issues out as possible. Kinda difficult if you'r doing dev, QA and release management as a team of "1". Hopefully I haven't left anything out, but let me know if any of that helps.

Be well...

commented

Oh, as an after thought, if you are running up against memory issues and it is swapping, you'll not be guaranteed your tensors will not be affected, but you will likely have problems, which it looks like you are, when it goes to allocate new blocks. If your pages are fairly active, the mark and sweep memory management algorithm should keep your pages from getting swapped out. If your swap used is growing as you run the model, that's not good.

Sometimes, I'd stopped everything like VSCode, Safari, Firefox, Mail, maybe even restart iTerm, but get your usage down to as little as possible when you are running, especially with your memory at 16GB, I believe that model you're using is about 8GB and all space for the model must be loaded in memory if you want to be able to use the Apple Silicon GPU support.

But yeah, shutdown or restart any apps which are using a lot of memory, some don't garbage collect nicely. Mail.app used to have issues with this, but not that I can really see lately. After you have reduced your active memory utilization as much as possible the start using mlock I've even had models not run even with mlock, but rebooting to clear all available memory and then first thing starting with mlock tends to keep things stable.

Thanks for all the advice! I'll take a deep breath and work on this step by step. ;)

  • Regarding n_ctx, I kinda have to set it high, as I'm not doing a chat with several prompts but one very long piece of text that I'm asking the model to summarize. I'm looking at chunking, but I don't think there's a way to split the text without affecting the analysis.

  • I've left batch_size at the default 512. I'll reduce it to 256 and see what happens.

  • Haven't tried mlock with this install, will try it now.

(webui.05.final-gguf) jhandl@macmini webui % pip list -v | grep llama.cpp      
llama_cpp_python          0.2.6        /Users/jhandl/miniconda3/envs/webui.05.final-gguf/lib/python3.10/site-packages pip
  • The swap space doesn't seem to grow significantly during inference, it just grabs it all at the start and stays fairly stable.

  • There's almost nothing else running on this machine. I'm connecting to it through VNC from my laptop. The only things running there are terminal, chrome with just one tab open for the UI, textedit holding the prompt and the Activity Monitor. It's a new machine too, so not much else installed.

On to run tests now.

Hmm. Setting mlock and reducing the batch_size seems to have had a negative effect. It slowed down from 0.06 tokens/s to 0.01 tokens/s.

And regarding swap space, I seem to have spoken too soon about the swap space stability. This time I monitored it closely, and it grew from 0 to over 6G during inference. I don't know if it reached 6G in previous tests. Reads and writes were both high and more or less equal this time, which is what I would expect from page swapping. I don't know why last time I saw only reads. I wasn't paying close attention to it though, so I might have caught it in a different part of the process.

I then reduced the n_ctx to 2048. It became much faster, almost 6 tokens/s but, predictably, it completely ignored everything other than the last part of the input text, and in the end it even forgot the instruction that was at the end of the text.

Next I kept the same parameters (n_ctx=2048 and batch_size=256) but added n_gpu_layers=1.

It maxed out the GPU and flew through the text in a few seconds (it took several minutes before) and started producing output at 13 tokens/s!

So the memory is the real issue here. I can't run this large a context with only 16GB of RAM. I'll have to figure out a way to split the text into chunks without impacting the result. That's an interesting problem but I'll take that to another place.

I think you can close this issue. Thanks a ton for your guidance!

commented

Thanks for your assistance too in checking up on me and keeping me honest.

I still think you should see an eye on your swapping and memory footprint of your model processor (pick a flavor), your model may crash if it gets swapped out or compressed and there are not enough contiguous blocks to allocate what the model needs during inference or when context fills up. You may be in for trouble, and the mlock causing tings to run slow is possibly a symptom of your system beginning to thrash with swap. That will impact performance in a big way. Keep an eye on it using something like vm_stat 5 or 10 which will give memory state every 5 or 10 seconds, you will see the swapping if it's happening. When swapping starts, it is the kiss of death.

You are on a knife edge with your juggling memory, but you are correct, breaking your context loads into smaller chunks is probably better, but I'm sure it will be a trade-off in your app somewhere else. If someone had told me I'd need 96GB RAM for my machine i'd tell them they were crazy, but I while I am able to run a 70B Q4 model, I'd still like to get a TTS and STT models running as well.

Glad to assist, you're very welcome and let me know/post here if you run into anything else! Closing ticket.

Be well...