Ollama speed dropped with setting OLLAMA_NUM_PARALLEL

Question

Ollama speed dropped with setting OLLAMA_NUM_PARALLEL

hugefrog opened this issue 2 months ago · comments

What is the issue?

After setting OLLAMA_NUM_PARALLEL in Ollama 0.1.38, the speed of single user access has dropped by half, and the GPU utilization rate is only about 50%."

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.1.38

Aleksandr Gromov · Answer 1 · Sat May 18 2024 10:04:16 GMT+0800 (China Standard Time)

Same problem here :(
But I'm using it on Debian 11, CPU - Ryzen 3900X, no GPU.

Aleksandr Gromov · Answer 2 · Sun May 19 2024 11:47:16 GMT+0800 (China Standard Time)

Okay, I did a little research on my case, and it doesn't seem to be relevant to this bug after all. I found similar issue with explanations here. Setting num_thread parameter in model file did help to utilize 100% of CPU in my case, but this did not really help to increase performance. Sorry for misleading you. I have been using Ollama not long ago, and until recently I had a server without HT. Your case probably different.

Daniel Hiltgen · Answer 3 · Wed May 22 2024 06:08:57 GMT+0800 (China Standard Time)

@hugefrog can you share some more details? What does ollama ps show with/without the parallel setting, and what did you set it to? We have to multiple the num parallel by the context size when loading into the GPU, so if you're loading a model that just barely fit without parallel, adding parallel might be pushing layers off the GPU and into system memory which could explain the slowdown. Our goal is to auto-select parallelism in the future based on available VRAM so we can avoid overflowing to CPU.

Daniel Hiltgen · Answer 4 · Sat Jun 22 2024 07:27:26 GMT+0800 (China Standard Time)

If you're still seeing performance problems, please make sure to upgrade to the latest version and share the ollama ps output so we can evaluate.

hugefrog · Answer 5 · Mon Jun 24 2024 15:22:04 GMT+0800 (China Standard Time)

If you're still seeing performance problems, please make sure to upgrade to the latest version and share the ollama ps output so we can evaluate.

Thank you for your response. I have updated Ollama to version 0.1.141 and conducted some test. I found that after setting OLLAMA_NUM_PARALLEL, the storage consumption of the yi:34b-chat-v1.5-q4_K_M model increased from 22GB to 25GB, which exceeds the memory capacity of my nvidia 3090, resulting in a decrease in speed.

Daniel Hiltgen · Answer 6 · Mon Jun 24 2024 23:13:55 GMT+0800 (China Standard Time)

@hugefrog that sounds like expected behavior with the current architecture. In an upcoming release if no parallel setting is defined, we'll auto-detect available VRAM and set a reasonable parallel level that keeps the model fitting in VRAM.