intel / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, DeepSeek, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, DeepSpeed, Axolotl, etc.

Repository from Github https://github.comintel/ipex-llmRepository from Github https://github.comintel/ipex-llm

Unable to fully load model into Vram using ollama zip gpu

dttprofessor opened this issue · comments

SYSTEM:U265K(igpu off)+48G ram+B580(12g)

deepseek-r1:14b (Q4):
B580 video memory is enough to load deepseek-r1:14b (Q4) model, but segmentation error occurs, less than 7G is loaded into VRAM,and the rest is loaded into shared GPU memory。

deepseek-r1:32b (Q4):
12G of the model is loaded into the dedicated GPU memory, and the remaining 8G is loaded into the shared GPU memory. The system RAM is basically not occupied and the CPU cannot participate in reasoning.

Could you check your GPU's VRAM usage before loading the model?

set OLLAMA_NUM_GPU=999

set OLLAMA_NUM_GPU=999

set no_proxy=localhost,127.0.0.1

set ZES_ENABLE_SYSMAN=1

set SYCL_CACHE_PERSISTENT=1

set SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1

set OLLAMA_KEEP_ALIVE=-1

set OLLAMA_NUM_PARALLEL=1

set OLLAMA_PARAMETER num_ctx 16384

set OLLAMA_PARAMETER num_predict 8192

set PARAMETER num_ctx 16384

set PARAMETER num_predict 8192