Unable to use GLM model
RonkyTang opened this issue · comments
Describe the bug
Error occurs when using the following GLM model
https://www.modelscope.cn/models/ZhipuAI/glm-edge-1.5b-chat-gguf
https://www.modelscope.cn/models/ZhipuAI/glm-edge-v-2b-gguf
Error messages:
llama runner process has terminated: error loading model: missing tensor 'blk.0.attn_qkv.weight'
llama_load_model_from_file: failed to load model
llama_model_load: error loading model: missing tensor 'blk.0.attn_qkv.weight'
llama_load_model_from_file: failed to load model
panic: unable to load model: /root/.ollama/models/blobs/sha256-1d4816cb2da5ac2a5acfa7315049ac9826d52842df81ac567de64755986949fa
goroutine 20 [running]:
ollama/llama/runner.(*Server).loadModel(0xc0004b2120, {0x3e7, 0x0, 0x0, 0x0, {0x0, 0x0, 0x0}, 0xc000502dd0, 0x0}, ...)
ollama/llama/runner/runner.go:861 +0x4ee
created by ollama/llama/runner.Execute in goroutine 1
ollama/llama/runner/runner.go:1001 +0xd0d
time=2025-03-26T11:22:38.876+08:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated:
error loading model: missing tensor 'blk.0.attn_qkv.weight'"
Hi @RonkyTang, we are working on upgrading ipex-llm ollama into a new version, and these two GLM models could be supported then.
Hi @RonkyTang, we are working on upgrading ipex-llm ollama into a new version, and these two GLM models could be supported then.
Thanks !
Hi, @sgwhat could you please share the schedule for the release? thanks!
I will release v0.6.x support in next week.
Two issues were identified when using the gml-v-2b-gbuf (https://www.modelscope.cn/models/ZhipuAI/glm-edge-v-2b-gguf ) model:
- Long reasoning time
- The returned content is all incorrect
- If using the official version of Ollama, everything is normal
Hi @RonkyTang, I have found out the reason, and it will be fixed in tmr's version.
Hi @RonkyTang, I have found out the reason, and it will be fixed in tmr's version.
Thanks!
Hi @sgwhat , once your fixing is ready, please drop us a message then we can have a try, thanks! cc @RonkyTang
Hi @RonkyTang , I am still working on fixing running this model's clip part on sycl backend. I will come back to you when this issue been fixed after a few days.
Hi @RonkyTang, we have released the new version of ollama in https://github.com/intel/ipex-llm/releases/tag/v2.3.0-nightly. We have optimized clip model to run on gpu on windows.
Hi @RonkyTang, we have released the new version of ollama in https://github.com/intel/ipex-llm/releases/tag/v2.3.0-nightly. We have optimized clip model to run on gpu on windows.
Hi @sgwhat , thank you for your reply. But there is still a problem, the loading of multimodal models takes a few minutes:
Hi @RonkyTang, Seems on ubuntu, clip still be forced running on cpu (it works well with a great perf on windows), this has been fixed and I will release the fixed version tmr.
Hi @RonkyTang, we have released the optimized version on ubuntu, which could run the clip model on GPU. You may install it via pip install --pre --upgrade ipex-llm[cpp]
Hi @RonkyTang, we have released the optimized version on ubuntu, which could run the clip model on GPU. You may install it via
pip install --pre --upgrade ipex-llm[cpp]
Hi @sgwhat , so you mean we need install an ipex-llm env for the runtime device?
Yes, in the conda env. You may refer to this installation guide.
Hi @sgwhat , the PreView version has a problem,we can't to use iGPU, :
but the release version can to used:

This is expected behavior — Ollama does not utilize the iGPU until a model is loaded, at which point you will see VRAM usage increase. As for the confusing log message, I will remove it later. @RonkyTang
This is expected behavior — Ollama does not utilize the iGPU until a model is loaded, at which point you will see VRAM usage increase. As for the confusing log message, I will remove it later. @RonkyTang
So, do you mean the preview version used iGPU?
So, do you mean the preview version used iGPU?
Yes, you may load a model to check.
Hi @sgwhat how to make a like ollama portable package?
And i copied all the libraries that ollama bin depends on to the ollama-bin directory and set environment variables, but the model cannot be used properly
Hi @sgwhat how to make a like ollama portable package? And i copied all the libraries that ollama bin depends on to the ollama-bin directory and set environment variables, but the model cannot be used properly
Hi @sgwhat , And we has found another problem:
when used ipex-ollama version, continuous memory usage of 17%(model is glm 1.5b):

but we used public ollama version, memory only used 4~5%(model also is glm 1.5b):

Hi @RonkyTang , we have release a new ollama version https://www.modelscope.cn/models/Intel/ollama .
Hi @RonkyTang , we have release a new ollama version https://www.modelscope.cn/models/Intel/ollama .
Hi @sgwhat Thank you for the updated. But it still has memory issues.
Hi @sgwhat how to make a like ollama portable package? And i copied all the libraries that ollama bin depends on to the ollama-bin directory and set environment variables, but the model cannot be used properly
Hi @sgwhat , And we has found another problem: when used ipex-ollama version, continuous memory usage of 17%(model is glm 1.5b):
but we used public ollama version, memory only used 4~5%(model also is glm 1.5b):
Hi @sgwhat ,How about this?
Hi @sgwhat how to make a like ollama portable package? And i copied all the libraries that ollama bin depends on to the ollama-bin directory and set environment variables, but the model cannot be used properly
Hi @sgwhat , And we has found another problem: when used ipex-ollama version, continuous memory usage of 17%(model is glm 1.5b):
but we used public ollama version, memory only used 4~5%(model also is glm 1.5b):Hi @sgwhat ,How about this?
Hi @sgwhat , How about this?
@sgwhat any comment on this issue?
@RonkyTang could you please check which ollama process cause more memory? you can use "top" then press "M" to sort them by memory usage.
At the same time, you could run "free -h" to check if the memory is allocated for "buff/cache"
Hi @RonkyTang , I apologize for the late reply. The memory usage depends on many factors, including different values of num_parallel and num_ctx. You can try adjusting these parameters to check.
Additionally, we’ve just released the latest version of Ollama, you may try running this version and share the actual memory usage with me.
Hi @RonkyTang , I apologize for the late reply. The memory usage depends on many factors, including different values of
num_parallelandnum_ctx. You can try adjusting these parameters to check. Additionally, we’ve just released the latest version of Ollama, you may try running this version and share the actual memory usage with me.
Hi @sgwhat , the problem is fixed at new version. thanks for your help.
And please support our other issue:
#13192


