ollama / ollama

Get up and running with Llama 3, Mistral, Gemma, and other large language models.

Home Page:https://ollama.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Ollama speed dropped with setting OLLAMA_NUM_PARALLEL

hugefrog opened this issue · comments

What is the issue?

After setting OLLAMA_NUM_PARALLEL in Ollama 0.1.38, the speed of single user access has dropped by half, and the GPU utilization rate is only about 50%."

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.1.38

Same problem here :(
But I'm using it on Debian 11, CPU - Ryzen 3900X, no GPU.

Okay, I did a little research on my case, and it doesn't seem to be relevant to this bug after all. I found similar issue with explanations here. Setting num_thread parameter in model file did help to utilize 100% of CPU in my case, but this did not really help to increase performance. Sorry for misleading you. I have been using Ollama not long ago, and until recently I had a server without HT. Your case probably different.

@hugefrog can you share some more details? What does ollama ps show with/without the parallel setting, and what did you set it to? We have to multiple the num parallel by the context size when loading into the GPU, so if you're loading a model that just barely fit without parallel, adding parallel might be pushing layers off the GPU and into system memory which could explain the slowdown. Our goal is to auto-select parallelism in the future based on available VRAM so we can avoid overflowing to CPU.

If you're still seeing performance problems, please make sure to upgrade to the latest version and share the ollama ps output so we can evaluate.

If you're still seeing performance problems, please make sure to upgrade to the latest version and share the ollama ps output so we can evaluate.

Thank you for your response. I have updated Ollama to version 0.1.141 and conducted some test. I found that after setting OLLAMA_NUM_PARALLEL, the storage consumption of the yi:34b-chat-v1.5-q4_K_M model increased from 22GB to 25GB, which exceeds the memory capacity of my nvidia 3090, resulting in a decrease in speed.

@hugefrog that sounds like expected behavior with the current architecture. In an upcoming release if no parallel setting is defined, we'll auto-detect available VRAM and set a reasonable parallel level that keeps the model fitting in VRAM.