Unable to use GLM model

Question

Unable to use GLM model

RonkyTang opened this issue 7 months ago · comments

Describe the bug
Error occurs when using the following GLM model
https://www.modelscope.cn/models/ZhipuAI/glm-edge-1.5b-chat-gguf
https://www.modelscope.cn/models/ZhipuAI/glm-edge-v-2b-gguf

Screenshots

Error messages:
llama runner process has terminated: error loading model: missing tensor 'blk.0.attn_qkv.weight'
llama_load_model_from_file: failed to load model

llama_model_load: error loading model: missing tensor 'blk.0.attn_qkv.weight'
llama_load_model_from_file: failed to load model
panic: unable to load model: /root/.ollama/models/blobs/sha256-1d4816cb2da5ac2a5acfa7315049ac9826d52842df81ac567de64755986949fa

goroutine 20 [running]:
ollama/llama/runner.(*Server).loadModel(0xc0004b2120, {0x3e7, 0x0, 0x0, 0x0, {0x0, 0x0, 0x0}, 0xc000502dd0, 0x0}, ...)
ollama/llama/runner/runner.go:861 +0x4ee
created by ollama/llama/runner.Execute in goroutine 1
ollama/llama/runner/runner.go:1001 +0xd0d
time=2025-03-26T11:22:38.876+08:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated:
error loading model: missing tensor 'blk.0.attn_qkv.weight'"

SONG Ge · Answer 1 · Thu Mar 27 2025 15:38:41 GMT+0800 (China Standard Time)

Hi @RonkyTang, we are working on upgrading ipex-llm ollama into a new version, and these two GLM models could be supported then.

Ronky · Answer 2 · Fri Mar 28 2025 16:25:58 GMT+0800 (China Standard Time)

Hi @RonkyTang, we are working on upgrading ipex-llm ollama into a new version, and these two GLM models could be supported then.

Thanks !

Huajun Li · Answer 3 · Tue Apr 01 2025 16:02:26 GMT+0800 (China Standard Time)

Hi, @sgwhat could you please share the schedule for the release? thanks!

SONG Ge · Answer 4 · Thu Apr 03 2025 13:08:08 GMT+0800 (China Standard Time)

Hi, @sgwhat could you please share the schedule for the release? thanks!

I will release v0.6.x support in next week.

Ronky · Answer 5 · Mon Apr 14 2025 15:17:09 GMT+0800 (China Standard Time)

Two issues were identified when using the gml-v-2b-gbuf (https://www.modelscope.cn/models/ZhipuAI/glm-edge-v-2b-gguf ) model:

Long reasoning time
The returned content is all incorrect
If using the official version of Ollama, everything is normal

SONG Ge · Answer 6 · Mon Apr 14 2025 15:58:03 GMT+0800 (China Standard Time)

Hi @RonkyTang, I have found out the reason, and it will be fixed in tmr's version.

Ronky · Answer 7 · Tue Apr 15 2025 09:19:27 GMT+0800 (China Standard Time)

Hi @RonkyTang, I have found out the reason, and it will be fixed in tmr's version.

Thanks!

Huajun Li · Answer 8 · Tue Apr 15 2025 15:47:08 GMT+0800 (China Standard Time)

Hi @sgwhat , once your fixing is ready, please drop us a message then we can have a try, thanks! cc @RonkyTang

SONG Ge · Answer 9 · Tue Apr 15 2025 18:54:13 GMT+0800 (China Standard Time)

Hi @RonkyTang , I am still working on fixing running this model's clip part on sycl backend. I will come back to you when this issue been fixed after a few days.

Ronky · Answer 10 · Fri Apr 18 2025 16:02:21 GMT+0800 (China Standard Time)

Hi @sgwhat , Can you talk about the current progress? thank you

SONG Ge · Answer 11 · Fri Apr 18 2025 16:05:19 GMT+0800 (China Standard Time)

Hi @RonkyTang, we have released the new version of ollama in https://github.com/intel/ipex-llm/releases/tag/v2.3.0-nightly. We have optimized clip model to run on gpu on windows.

Ronky · Answer 12 · Fri Apr 18 2025 17:06:54 GMT+0800 (China Standard Time)

Hi @RonkyTang, we have released the new version of ollama in https://github.com/intel/ipex-llm/releases/tag/v2.3.0-nightly. We have optimized clip model to run on gpu on windows.

Hi @sgwhat , thank you for your reply. But there is still a problem, the loading of multimodal models takes a few minutes:

SONG Ge · Answer 13 · Mon Apr 21 2025 15:18:44 GMT+0800 (China Standard Time)

Hi @RonkyTang, Seems on ubuntu, clip still be forced running on cpu (it works well with a great perf on windows), this has been fixed and I will release the fixed version tmr.

SONG Ge · Answer 14 · Tue Apr 22 2025 09:49:43 GMT+0800 (China Standard Time)

Hi @RonkyTang, we have released the optimized version on ubuntu, which could run the clip model on GPU. You may install it via pip install --pre --upgrade ipex-llm[cpp]

Ronky · Answer 15 · Tue Apr 22 2025 10:30:53 GMT+0800 (China Standard Time)

Hi @RonkyTang, we have released the optimized version on ubuntu, which could run the clip model on GPU. You may install it via pip install --pre --upgrade ipex-llm[cpp]

Hi @sgwhat , so you mean we need install an ipex-llm env for the runtime device?

SONG Ge · Answer 16 · Tue Apr 22 2025 13:56:07 GMT+0800 (China Standard Time)

Yes, in the conda env. You may refer to this installation guide.

Ronky · Answer 17 · Fri Apr 25 2025 10:18:08 GMT+0800 (China Standard Time)

Hi @sgwhat , the PreView version has a problem,we can't to use iGPU, :

but the release version can to used:

SONG Ge · Answer 18 · Fri Apr 25 2025 10:27:29 GMT+0800 (China Standard Time)

This is expected behavior — Ollama does not utilize the iGPU until a model is loaded, at which point you will see VRAM usage increase. As for the confusing log message, I will remove it later. @RonkyTang

Ronky · Answer 19 · Fri Apr 25 2025 10:31:11 GMT+0800 (China Standard Time)

This is expected behavior — Ollama does not utilize the iGPU until a model is loaded, at which point you will see VRAM usage increase. As for the confusing log message, I will remove it later. @RonkyTang

So, do you mean the preview version used iGPU?

SONG Ge · Answer 20 · Fri Apr 25 2025 10:36:44 GMT+0800 (China Standard Time)

So, do you mean the preview version used iGPU?

Yes, you may load a model to check.

Ronky · Answer 21 · Fri Apr 25 2025 10:41:03 GMT+0800 (China Standard Time)

ok ,I hope it's just a log printing error

Ronky · Answer 22 · Fri Apr 25 2025 13:50:18 GMT+0800 (China Standard Time)

Hi @sgwhat how to make a like ollama portable package?
And i copied all the libraries that ollama bin depends on to the ollama-bin directory and set environment variables, but the model cannot be used properly

Ronky · Answer 23 · Wed Apr 30 2025 13:57:37 GMT+0800 (China Standard Time)

Hi @sgwhat how to make a like ollama portable package? And i copied all the libraries that ollama bin depends on to the ollama-bin directory and set environment variables, but the model cannot be used properly

Hi @sgwhat , And we has found another problem:
when used ipex-ollama version, continuous memory usage of 17%(model is glm 1.5b):

but we used public ollama version, memory only used 4~5%(model also is glm 1.5b):

SONG Ge · Answer 24 · Wed Apr 30 2025 14:33:15 GMT+0800 (China Standard Time)

Hi @RonkyTang , we have release a new ollama version https://www.modelscope.cn/models/Intel/ollama .

Ronky · Answer 25 · Wed Apr 30 2025 15:37:05 GMT+0800 (China Standard Time)

Hi @RonkyTang , we have release a new ollama version https://www.modelscope.cn/models/Intel/ollama .

Hi @sgwhat Thank you for the updated. But it still has memory issues.

Ronky · Answer 26 · Fri May 09 2025 10:41:46 GMT+0800 (China Standard Time)

Hi @sgwhat how to make a like ollama portable package? And i copied all the libraries that ollama bin depends on to the ollama-bin directory and set environment variables, but the model cannot be used properly

Hi @sgwhat , And we has found another problem: when used ipex-ollama version, continuous memory usage of 17%(model is glm 1.5b):

but we used public ollama version, memory only used 4~5%(model also is glm 1.5b):

Hi @sgwhat ，How about this?

Ronky · Answer 27 · Fri May 16 2025 17:53:46 GMT+0800 (China Standard Time)

Hi @sgwhat how to make a like ollama portable package? And i copied all the libraries that ollama bin depends on to the ollama-bin directory and set environment variables, but the model cannot be used properly

Hi @sgwhat , And we has found another problem: when used ipex-ollama version, continuous memory usage of 17%(model is glm 1.5b):
but we used public ollama version, memory only used 4~5%(model also is glm 1.5b):

Hi @sgwhat ，How about this?

Hi @sgwhat , How about this?

Huajun Li · Answer 28 · Tue May 20 2025 12:51:33 GMT+0800 (China Standard Time)

@sgwhat any comment on this issue?

@RonkyTang could you please check which ollama process cause more memory? you can use "top" then press "M" to sort them by memory usage.
At the same time, you could run "free -h" to check if the memory is allocated for "buff/cache"

SONG Ge · Answer 29 · Tue May 20 2025 15:32:14 GMT+0800 (China Standard Time)

Hi @RonkyTang , I apologize for the late reply. The memory usage depends on many factors, including different values of num_parallel and num_ctx. You can try adjusting these parameters to check.
Additionally, we’ve just released the latest version of Ollama, you may try running this version and share the actual memory usage with me.

Ronky · Answer 30 · Tue May 27 2025 11:31:36 GMT+0800 (China Standard Time)

Hi @RonkyTang , I apologize for the late reply. The memory usage depends on many factors, including different values of num_parallel and num_ctx. You can try adjusting these parameters to check. Additionally, we’ve just released the latest version of Ollama, you may try running this version and share the actual memory usage with me.

Hi @sgwhat , the problem is fixed at new version. thanks for your help.
And please support our other issue:
#13192