Not compiled with GPU offload support

Question

Not compiled with GPU offload support

oldgithubman opened this issue 4 months ago · comments

What is the issue?

Trying to use ollama like normal with GPU. Worked before update. Now only using CPU.
$ journalctl -u ollama
reveals
WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support | n_gpu_layers=-1

I do not manually compile ollama. I use the standard install script.
Main README.md contains no mention of BLAS

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.1.38

Old Man · Answer 1 · Fri May 17 2024 11:12:01 GMT+0800 (China Standard Time)

Figured it out. Ollama seems to think the model is too big to fit in VRAM (it isn't - it worked fine before the update). There is a lack of any useful communication about this to the user. As mentioned above, digging in the log actually sends you in the wrong direction

Jeffrey Morgan · Answer 2 · Fri May 17 2024 14:37:08 GMT+0800 (China Standard Time)

Hi @oldmanjk sorry about this. May I ask which model you are running? and on which GPU?

mroxso · Answer 3 · Fri May 17 2024 15:45:11 GMT+0800 (China Standard Time)

I think I got the same issue.
Running llama2:latest and llama3:latest on my GTX 1660 SUPER.
Worked before, now I updated to the latest Ollama and it seems that it mostly uses CPU which is way slower.

// Update:
For me it seems like I had another process blocking my VRAM (A Python process). I saw this with nvidia-smi
I restarted my local PC and now it works again with GPU for me.

Juk Armstrong · Answer 4 · Sat May 18 2024 02:58:41 GMT+0800 (China Standard Time)

Has anybody an idea of the code we need to remove to stop it ignoring our num_gpu settings (again, sigh...)?

Juk Armstrong · Answer 5 · Sat May 18 2024 03:15:03 GMT+0800 (China Standard Time)

It's at the bottom of llm/memory.go:

        //if memoryRequiredPartial > memoryAvailable {
        //      slog.Debug("insufficient VRAM to load any model layers")
        //      return 0, 0, memoryRequiredTotal
        //}

Old Man · Answer 6 · Sat May 18 2024 05:39:31 GMT+0800 (China Standard Time)

Hi @oldmanjk sorry about this. May I ask which model you are running? and on which GPU?

llama3 on a 1080 Ti

Old Man · Answer 7 · Sat May 18 2024 05:41:42 GMT+0800 (China Standard Time)

I think I got the same issue. Running llama2:latest and llama3:latest on my GTX 1660 SUPER. Worked before, now I updated to the latest Ollama and it seems that it mostly uses CPU which is way slower.

// Update: For me it seems like I had another process blocking my VRAM (A Python process). I saw this with nvidia-smi I restarted my local PC and now it works again with GPU for me.

Definitely worth keeping an eye on your GPU memory (which I do - I keep a widget in view at all times - that wasn't the issue for me)

Old Man · Answer 8 · Sat May 18 2024 05:44:13 GMT+0800 (China Standard Time)

Has anybody an idea of the code we need to remove to stop it ignoring our num_gpu settings (again, sigh...)?

Also weird is how, if ollama thinks it can't fit the entire model in VRAM, it doesn't attempt to put any layers in VRAM. I actually like this behavior though because it makes it obvious something is wrong. Still, more communication to the user would be good

Soc Virnyl S. Estela · Answer 9 · Sat May 18 2024 23:09:10 GMT+0800 (China Standard Time)

got the same issue here on openSUSE Tumbleweed. one thing i noticed is, it uses the GPU for a moment then gone...

Screencast_20240518_221101.webm

Daniel Hiltgen · Answer 10 · Wed May 22 2024 05:18:17 GMT+0800 (China Standard Time)

We've recently introduced ollama ps which will help show how much of the model has loaded into VRAM.

We've fixed a few bugs recently around num_gpu handling in some of our prediction logic, but I'm not sure that addresses your comment @jukofyork. Can you explain what you're trying to do? The goal of our prediction algorithm is to set num_gpu automatically based on the available VRAM. There is a minimum requirement for models and if we can't even allocate that minimal amount, then we will fall back to CPU. If we can satisfy the minimal amount, but not load the full amount, we will partially load on the GPU. Are you trying to set a lower value to preserve more space on the GPU, or did we predict incorrectly and you're trying to specify more layers? If our prediction was right, and you still push higher, we'll likely OOM crash by trying to allocate too many layers on the GPU.

@oldmanjk can you clarify your problem? Perhaps ollama ps output and server log can help us understand what's going on.

Old Man · Answer 11 · Wed May 22 2024 05:33:48 GMT+0800 (China Standard Time)

We've recently introduced ollama ps which will help show how much of the model has loaded into VRAM.

We've fixed a few bugs recently around num_gpu handling in some of our prediction logic, but I'm not sure that addresses your comment @jukofyork. Can you explain what you're trying to do? The goal of our prediction algorithm is to set num_gpu automatically based on the available VRAM. There is a minimum requirement for models and if we can't even allocate that minimal amount, then we will fall back to CPU. If we can satisfy the minimal amount, but not load the full amount, we will partially load on the GPU. Are you trying to set a lower value to preserve more space on the GPU, or did we predict incorrectly and you're trying to specify more layers? If our prediction was right, and you still push higher, we'll likely OOM crash by trying to allocate too many layers on the GPU.

@oldmanjk can you clarify your problem? Perhaps ollama ps output and server log can help us understand what's going on.

I'm not at a terminal atm, but ollama refuses to load the same size models it used to and that other back ends will (like ooba with llama-cpp-python). Depending on the model/quant, I have to reduce num_gpu by a few layers compared to old ollama or ooba. When you've carefully optimized your quants like i have, this is the difference between fully-offloaded and not. On a repurposed mining rig, this destroys performance. Also, if I don't change the modelfile (which is a pain on a slow rig), ollama won't offload anything to gpu

Old Man · Answer 12 · Wed May 22 2024 10:20:49 GMT+0800 (China Standard Time)

Example walkthrough:

Determine ooba/llama-cpp-python can load and run over 1K context (this usually means it will run full context without crashing) my latest Meta-Llama-3-70B-Instruct-Q3_K_L quant with 8K context and 48 layers offloaded to GPU.
So now I'll go to the Modelfile and set num_ctx to 8192, num_gpu to 48, num_thread to 32 (to get ollama to use 24 threads (sigh)), and import the gguf into ollama.
About a minute and 37.1 GB of wear and tear on my nvme later, success.
Attempt to call model.
ollama offloads nothing to GPU (or system RAM, for that matter (outside of a few GiB) - it's running it straight off the nvme? Why? I have over 90 GiB of RAM free)
```
$ ollama ps
NAME                                            ID              SIZE    PROCESSOR       UNTIL              
Meta-Llama-3-70B-Instruct-Q3_K_L-8K:latest      b9345a582769    41 GB   100% CPU        4 minutes from now
```
ollama_logs.txt attached - note the three locations I've highlighted falsehoods claimed by ollama.
ollama_logs.txt
ollama rm Meta-Llama-3-70B-Instruct-Q3_K_L-8K (autocomplete would be nice)
Go back to the Modelfile, set num_gpu to 47, and import the gguf into ollama again.
Wait another minute or so (on this very-fast machine - on the old mining rig this can take upwards of ten minutes).
Another 37.1 GB of wear and tear on my nvme later, success.
Attempt to call model.
Lucky! This time it works. Sometimes I have to repeat these steps a few times. 22.8 / 24.0 GiB used - that layer should have fit (in fact, we already know it does).

Edit - Now ollama is using all 32 threads (I want it to use 24 probably) and basically 0% GPU. I have no idea what's going on here.
Edit - Removing num_thread produces 20% CPU utilization, whereas before I was seeing 10%. I don't know what's going on here either. Assuming we want all physical cores utilized, it should be 24/32 or 75%

Daniel Hiltgen · Answer 13 · Sat Jun 01 2024 07:46:04 GMT+0800 (China Standard Time)

@oldmanjk the log you attached above seems to show a 2nd attempt where we fell back to the runners/cpu_avx2/ollama_llama_server CPU subprocess, after most likely unsuccessfully running on the GPU. Can you share a complete log so we can see what went wrong?

sudo systemctl stop ollama
OLLAMA_DEBUG=1 ollama serve 2>&1 | tee server.log

Old Man · Answer 14 · Sat Jun 01 2024 12:10:17 GMT+0800 (China Standard Time)

@oldmanjk the log you attached above seems to show a 2nd attempt where we fell back to the runners/cpu_avx2/ollama_llama_server CPU subprocess, after most likely unsuccessfully running on the GPU. Can you share a complete log so we can see what went wrong?
sudo systemctl stop ollama
OLLAMA_DEBUG=1 ollama serve 2>&1 | tee server.log

requested.log

What is clear, from both logs (as I already pointed out in the previous log), is ollama is wrong about memory, both total and available. Ollama says my NVIDIA GeForce RTX 4090 (founder's edition - as standard as it gets) has 23.6 GiB total memory (obviously wrong) and 23.2 GiB available memory (also wrong). The true numbers, according to nvidia-smi, are 24564 MiB (24.0 GiB, of course) total memory and 55 MiB used (24564 MiB - 55 MiB = 24509 MiB = 23.9 GiB available memory). So ollama thinks I have less memory than I do, so it refuses to load models it used to load just fine. Hence why not offloading a layer or two to GPU causes it to work again. I think you have all the information you need from me. You just need to figure out why ollama is incorrectly detecting memory. If I had to guess, it's probably a classic case of wrong units or conversions thereof (GiB vs GB). You know, that thing they beat into our heads to be careful about in high school science class. The thing that caused the Challenger disaster. Y'all need to slow down, be more careful, and put out good code. This would, paradoxically, give you more time because you wouldn't have to spend so much time putting out fires. Again, all of this information was already available, so this was an unnecessary waste of my time too. I've attached the requested log anyway.

Edit - I'm no software dev, but...maybe start here: #4328
If I'm right that that's the problem (that a dev arbitrarily decided to shave a layer of space off as a "buffer", breaking the existing workflows of countless users, and no one notifying the user base or even all the other devs, causing hours of wasted time and confusion)...well...that's pretty bone-headed. The obvious typo in the original comment (the one one would catch by reviewing one's pull request even once) illustrates my point (about slowing down) pretty spectacularly. Hell, a spell checker would have caught that. If I sound frustrated, it's because I am

Daniel Pereira · Answer 15 · Mon Jun 03 2024 15:24:13 GMT+0800 (China Standard Time)

@oldmanjk the log you attached above seems to show a 2nd attempt where we fell back to the runners/cpu_avx2/ollama_llama_server CPU subprocess, after most likely unsuccessfully running on the GPU. Can you share a complete log so we can see what went wrong?
sudo systemctl stop ollama
OLLAMA_DEBUG=1 ollama serve 2>&1 | tee server.log
requested.log

What is clear, from both logs (as I already pointed out in the previous log), is ollama is wrong about memory, both total and available. Ollama says my NVIDIA GeForce RTX 4090 (founder's edition - as standard as it gets) has 23.6 GiB total memory (obviously wrong) and 23.2 GiB available memory (also wrong). The true numbers, according to nvidia-smi, are 24564 MiB (24.0 GiB, of course) total memory and 55 MiB used (24564 MiB - 55 MiB = 24509 MiB = 23.9 GiB available memory). So ollama thinks I have less memory than I do, so it refuses to load models it used to load just fine. Hence why not offloading a layer or two to GPU causes it to work again. I think you have all the information you need from me. You just need to figure out why ollama is incorrectly detecting memory. If I had to guess, it's probably a classic case of wrong units or conversions thereof (GiB vs GB). You know, that thing they beat into our heads to be careful about in high school science class. The thing that caused the Challenger disaster. Y'all need to slow down, be more careful, and put out good code. This would, paradoxically, give you more time because you wouldn't have to spend so much time putting out fires. Again, all of this information was already available, so this was an unnecessary waste of my time too. I've attached the requested log anyway.

Edit - I'm no software dev, but...maybe start here: #4328 If I'm right that that's the problem (that a dev arbitrarily decided to shave a layer of space off as a "buffer", breaking the existing workflows of countless users, and no one notifying the user base or even all the other devs, causing hours of wasted time and confusion)...well...that's pretty bone-headed. The obvious typo in the original comment (the one one would catch by reviewing one's pull request even once) illustrates my point (about slowing down) pretty spectacularly. Hell, a spell checker would have caught that. If I sound frustrated, it's because I am

I'm not affiliated with this project by any means, I'm just a peasant who happens to be facing this issue as well, and I appreciate your diagnostics so far, I'm also using a Pascal based GPU and no luck.

That said, and while I understand your frustration as I'm also affected by this issue, there's no need to be snarky with the contributors. This is open source and no one is obligated to provide free support, most of us do it for passion. Next time avoid expressing your frustration like that towards other developers who owe you nothing, it will hurt more than you think. As a more practical and constructive criticism, you can point out what you think is the cause of the issue and ask how you can help address it, perhaps even patching and recompiling if you know how to.

Old Man · Answer 16 · Mon Jun 03 2024 18:10:12 GMT+0800 (China Standard Time)

I appreciate your diagnostics so far

Thank you and you're welcome.

I understand your frustration as I'm also affected by this issue

Then you don't understand my frustration.

no one is obligated to provide free support

Straw man.

avoid expressing your frustration like that

Point considered and rejected.

developers who owe you nothing

Straw man.

it will hurt more than you think

You can't know this. If it would hurt you, that's a you problem and I would suggest recalling the ancient wisdom of "sticks and stones..."

As a more practical and constructive criticism, you can point out what you think is the cause of the issue

Did you actually read what I wrote?

and ask how you can help address it,

If the devs need help, they can ask. As they've been doing. And as I've been responding with their requests. You haven't actually read this thread, have you? All I've been doing is helping. You just think I'm mean. I prioritize actually helping over what people think about me. Why didn't you ask how you can help address it, since you think that's valuable advice?

perhaps even patching and recompiling if you know how to

I don't know how to, but I'd learn if they asked. That would be consistent with my past behavior. At this point, I'm not even sure why I'm wasting time on responding to you. You suggest I help, when that's what I'm doing here. Yeah, I'm done. Peace.

Quentin Crain · Answer 17 · Mon Jul 22 2024 10:18:24 GMT+0800 (China Standard Time)

Literally like 20min later: Im an idiot! On arch linux install ollama-cuda. Why it took me hours to find that is yet another bit of evidence I probably should be given a keyboard! hahaha!

Hi! Here is my equivalent issue and log file. Hope this helps!!

time=2024-07-21T18:55:31.136-07:00 level=INFO source=gpu.go:205 msg="looking for compatible GPUs"
time=2024-07-21T18:55:31.136-07:00 level=DEBUG source=gpu.go:91 msg="searching for GPU discovery libraries for NVIDIA"
time=2024-07-21T18:55:31.136-07:00 level=DEBUG source=gpu.go:468 msg="Searching for GPU library" name=libcuda.so*
time=2024-07-21T18:55:31.136-07:00 level=DEBUG source=gpu.go:487 msg="gpu library search" globs="[<<<HOME>>/libcuda.so** /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
time=2024-07-21T18:55:31.146-07:00 level=DEBUG source=gpu.go:521 msg="discovered GPU libraries" paths="[/usr/lib/libcuda.so.555.58.02 /usr/lib64/libcuda.so.555.58.02]"
CUDA driver version: 12.5
time=2024-07-21T18:55:31.254-07:00 level=DEBUG source=gpu.go:124 msg="detected GPUs" count=1 library=/usr/lib/libcuda.so.555.58.02
[GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8] CUDA totalMem 7788 mb
[GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8] CUDA freeMem 7369 mb
[GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8] Compute Capability 7.5
time=2024-07-21T18:55:31.457-07:00 level=DEBUG source=amd_linux.go:356 msg="amdgpu driver not detected /sys/module/amdgpu"
releasing cuda driver library
time=2024-07-21T18:55:31.457-07:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8 library=cuda compute=7.5 driver=12.5 name="NVIDIA GeForce RTX 2070 SUPER" total="7.6 GiB" available="7.2 GiB"

  .  .  .

time=2024-07-21T18:55:37.564-07:00 level=INFO source=sched.go:701 msg="new model will fit in available VRAM in single GPU, loading" model=<<<HOME>>/.ollama/models/blobs/sha256-dd0c6f2ea876e4c433325df3398386f24e00d321abf6cec197c1bc1fcf1e0025 gpu=GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8 parallel=4 available=7727087616 required="2.6 GiB"
time=2024-07-21T18:55:37.564-07:00 level=DEBUG source=server.go:100 msg="system memory" total="31.3 GiB" free="29.4 GiB" free_swap="0 B"
time=2024-07-21T18:55:37.564-07:00 level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[7.2 GiB]"
time=2024-07-21T18:55:37.564-07:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=19 layers.offload=19 layers.split="" memory.available="[7.2 GiB]" memory.required.full="2.6 GiB" memory.required.partial="2.6 GiB" memory.required.kv="144.0 MiB" memory.required.allocations="[2.6 GiB]" memory.weights.total="1.2 GiB" memory.weights.repeating="797.2 MiB" memory.weights.nonrepeating="410.2 MiB" memory.graph.full="504.0 MiB" memory.graph.partial="914.2 MiB"

  .  .  .

time=2024-07-21T18:55:37.565-07:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama485515592/runners/cpu_avx2/ollama_llama_server --model <<<HOME>>/.ollama/models/blobs/sha256-dd0c6f2ea876e4c433325df3398386f24e00d321abf6cec197c1bc1fcf1e0025 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 19 --verbose --parallel 4 --port 34295"
time=2024-07-21T18:55:37.565-07:00 level=DEBUG source=server.go:398 msg=subprocess environment="[CUDA_PATH=/opt/cuda PATH=/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/opt/cuda/bin:/opt/cuda/nsight_compute:/opt/cuda/nsight_systems/bin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl LD_LIBRARY_PATH=/tmp/ollama485515592/runners/cpu_avx2:/tmp/ollama485515592/runners]"
time=2024-07-21T18:55:37.565-07:00 level=INFO source=sched.go:437 msg="loaded runners" count=1
time=2024-07-21T18:55:37.565-07:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding"
time=2024-07-21T18:55:37.565-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error"
WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support | n_gpu_layers=-1 tid="126402163820352" timestamp=1721613337
INFO [main] build info | build=3337 commit="a8db2a9ce" tid="126402163820352" timestamp=1721613337
INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="126402163820352" timestamp=1721613337 total_threads=6
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="6" port="34295" tid="126402163820352" timestamp=1721613337

ollama-cleaner.log

Old Man · Answer 18 · Mon Jul 22 2024 18:52:06 GMT+0800 (China Standard Time)

Literally like 20min later: Im an idiot! On arch linux install ollama-cuda. Why it took me hours to find that is yet another bit of evidence I probably should be given a keyboard! hahaha!

Hi! Here is my equivalent issue and log file. Hope this helps!!

time=2024-07-21T18:55:31.136-07:00 level=INFO source=gpu.go:205 msg="looking for compatible GPUs"
time=2024-07-21T18:55:31.136-07:00 level=DEBUG source=gpu.go:91 msg="searching for GPU discovery libraries for NVIDIA"
time=2024-07-21T18:55:31.136-07:00 level=DEBUG source=gpu.go:468 msg="Searching for GPU library" name=libcuda.so*
time=2024-07-21T18:55:31.136-07:00 level=DEBUG source=gpu.go:487 msg="gpu library search" globs="[<<<HOME>>/libcuda.so** /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
time=2024-07-21T18:55:31.146-07:00 level=DEBUG source=gpu.go:521 msg="discovered GPU libraries" paths="[/usr/lib/libcuda.so.555.58.02 /usr/lib64/libcuda.so.555.58.02]"
CUDA driver version: 12.5
time=2024-07-21T18:55:31.254-07:00 level=DEBUG source=gpu.go:124 msg="detected GPUs" count=1 library=/usr/lib/libcuda.so.555.58.02
[GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8] CUDA totalMem 7788 mb
[GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8] CUDA freeMem 7369 mb
[GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8] Compute Capability 7.5
time=2024-07-21T18:55:31.457-07:00 level=DEBUG source=amd_linux.go:356 msg="amdgpu driver not detected /sys/module/amdgpu"
releasing cuda driver library
time=2024-07-21T18:55:31.457-07:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8 library=cuda compute=7.5 driver=12.5 name="NVIDIA GeForce RTX 2070 SUPER" total="7.6 GiB" available="7.2 GiB"

  .  .  .

time=2024-07-21T18:55:37.564-07:00 level=INFO source=sched.go:701 msg="new model will fit in available VRAM in single GPU, loading" model=<<<HOME>>/.ollama/models/blobs/sha256-dd0c6f2ea876e4c433325df3398386f24e00d321abf6cec197c1bc1fcf1e0025 gpu=GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8 parallel=4 available=7727087616 required="2.6 GiB"
time=2024-07-21T18:55:37.564-07:00 level=DEBUG source=server.go:100 msg="system memory" total="31.3 GiB" free="29.4 GiB" free_swap="0 B"
time=2024-07-21T18:55:37.564-07:00 level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[7.2 GiB]"
time=2024-07-21T18:55:37.564-07:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=19 layers.offload=19 layers.split="" memory.available="[7.2 GiB]" memory.required.full="2.6 GiB" memory.required.partial="2.6 GiB" memory.required.kv="144.0 MiB" memory.required.allocations="[2.6 GiB]" memory.weights.total="1.2 GiB" memory.weights.repeating="797.2 MiB" memory.weights.nonrepeating="410.2 MiB" memory.graph.full="504.0 MiB" memory.graph.partial="914.2 MiB"

  .  .  .

time=2024-07-21T18:55:37.565-07:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama485515592/runners/cpu_avx2/ollama_llama_server --model <<<HOME>>/.ollama/models/blobs/sha256-dd0c6f2ea876e4c433325df3398386f24e00d321abf6cec197c1bc1fcf1e0025 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 19 --verbose --parallel 4 --port 34295"
time=2024-07-21T18:55:37.565-07:00 level=DEBUG source=server.go:398 msg=subprocess environment="[CUDA_PATH=/opt/cuda PATH=/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/opt/cuda/bin:/opt/cuda/nsight_compute:/opt/cuda/nsight_systems/bin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl LD_LIBRARY_PATH=/tmp/ollama485515592/runners/cpu_avx2:/tmp/ollama485515592/runners]"
time=2024-07-21T18:55:37.565-07:00 level=INFO source=sched.go:437 msg="loaded runners" count=1
time=2024-07-21T18:55:37.565-07:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding"
time=2024-07-21T18:55:37.565-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error"
WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support | n_gpu_layers=-1 tid="126402163820352" timestamp=1721613337
INFO [main] build info | build=3337 commit="a8db2a9ce" tid="126402163820352" timestamp=1721613337
INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="126402163820352" timestamp=1721613337 total_threads=6
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="6" port="34295" tid="126402163820352" timestamp=1721613337

ollama-cleaner.log

I switched to llama.cpp. It's better

Edit - I'm evaluating mistral.rs now. Excellent dev

Daniel Hiltgen · Answer 19 · Sat Aug 10 2024 07:21:46 GMT+0800 (China Standard Time)

We've fixed quite a few prediction bugs since 0.1.38, so I'm going to close this one out. If you're still hitting OOM's on 0.3.4, please share what model you were trying to load, and the server log and I'll reopen.