intel / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, DeepSeek, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, DeepSpeed, Axolotl, etc.

Repository from Github https://github.comintel/ipex-llmRepository from Github https://github.comintel/ipex-llm

UHD Graphics 730 run DeepSeek-R1-14B error

pruidong opened this issue · comments

./start-ollama.bat

.\ollama run deepseek-r1:14b

error log:

time=2025-03-24T21:28:12.267+08:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-03-24T21:28:12.267+08:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=6 efficiency=0 threads=12
time=2025-03-24T21:28:12.326+08:00 level=INFO source=server.go:104 msg="system memory" total="15.8 GiB" free="8.2 GiB" free_swap="7.2 GiB"
time=2025-03-24T21:28:12.328+08:00 level=INFO source=memory.go:356 msg="offload to device" layers.requested=-1 layers.model=49 layers.offload=0 layers.split="" memory.available="[8.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="9.4 GiB" memory.required.partial="0 B" memory.required.kv="384.0 MiB" memory.required.allocations="[8.1 GiB]" memory.weights.total="7.7 GiB" memory.weights.repeating="7.1 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="307.0 MiB" memory.graph.partial="916.1 MiB"
time=2025-03-24T21:28:12.338+08:00 level=INFO source=server.go:392 msg="starting llama server" cmd="F:\\Soft\\ollama-ipex-llm-2.2.0\\ollama-lib.exe runner --model F:\\Soft\\Ollama\\Models\\blobs\\sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e --ctx-size 2048 --batch-size 512 --n-gpu-layers 999 --threads 6 --no-mmap --parallel 1 --port 58205"
time=2025-03-24T21:28:12.389+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-03-24T21:28:12.389+08:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding"
time=2025-03-24T21:28:12.389+08:00 level=INFO source=server.go:605 msg="waiting for server to become available" status="llm server error"
ggml_sycl_init: found 1 SYCL devices:
time=2025-03-24T21:28:12.657+08:00 level=INFO source=runner.go:967 msg="starting go runner"
time=2025-03-24T21:28:12.663+08:00 level=INFO source=runner.go:968 msg=system info="CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6
time=2025-03-24T21:28:12.664+08:00 level=INFO source=runner.go:1026 msg="Server listening on 127.0.0.1:58205"
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llama_load_model_from_file: using device SYCL0 (Intel(R) UHD Graphics 730) - 7144 MiB free
llama_model_loader: loaded meta data with 26 key-value pairs and 579 tensors from F:\Soft\Ollama\Models\blobs\sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 14B
llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv   4:                         general.size_label str              = 14B
llama_model_loader: - kv   5:                          qwen2.block_count u32              = 48
llama_model_loader: - kv   6:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   7:                     qwen2.embedding_length u32              = 5120
llama_model_loader: - kv   8:                  qwen2.feed_forward_length u32              = 13824
llama_model_loader: - kv   9:                 qwen2.attention.head_count u32              = 40
llama_model_loader: - kv  10:              qwen2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  11:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  12:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  13:                          general.file_type u32              = 15
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 151646
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  241 tensors
llama_model_loader: - type q4_K:  289 tensors
llama_model_loader: - type q6_K:   49 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 22
time=2025-03-24T21:28:12.892+08:00 level=INFO source=server.go:605 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_layer          = 48
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 5
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: f_attn_scale     = 0.0e+00
llm_load_print_meta: n_ff             = 13824
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 14B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 14.77 B
llm_load_print_meta: model size       = 8.37 GiB (4.87 BPW)
llm_load_print_meta: general.name     = DeepSeek R1 Distill Qwen 14B
llm_load_print_meta: BOS token        = 151646 '<|begin▁of▁sentence|>'
llm_load_print_meta: EOS token        = 151643 '<|end▁of▁sentence|>'
llm_load_print_meta: EOT token        = 151643 '<|end▁of▁sentence|>'
llm_load_print_meta: PAD token        = 151643 '<|end▁of▁sentence|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token        = 151643 '<|end▁of▁sentence|>'
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llm_load_tensors: offloading 48 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 49/49 layers to GPU
llm_load_tensors:        SYCL0 model buffer size =  8148.38 MiB
llm_load_tensors:          CPU model buffer size =   417.66 MiB
Native API failed. Native API returns: 39 (UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY)
Exception caught at file:D:\actions-runner\release-cpp-oneapi_2024_2\_work\llm.cpp\llm.cpp\ollama-llama-cpp\ggml\src\ggml-sycl\ggml-sycl.cpp, line:345, func:operator()
SYCL error: CHECK_TRY_ERROR((*stream).memcpy((char *)tensor->data + offset, host_buf, size) .wait()): Meet error in this line code!
  in function ggml_backend_sycl_buffer_set_tensor at D:\actions-runner\release-cpp-oneapi_2024_2\_work\llm.cpp\llm.cpp\ollama-llama-cpp\ggml\src\ggml-sycl\ggml-sycl.cpp:345
D:\actions-runner\release-cpp-oneapi_2024_2\_work\llm.cpp\llm.cpp\ollama-llama-cpp\ggml\src\ggml-sycl\..\ggml-sycl\common.hpp:107: SYCL error
time=2025-03-24T21:28:14.094+08:00 level=INFO source=server.go:605 msg="waiting for server to become available" status="llm server not responding"
time=2025-03-24T21:28:14.535+08:00 level=INFO source=server.go:605 msg="waiting for server to become available" status="llm server error"
time=2025-03-24T21:28:14.785+08:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: error:CHECK_TRY_ERROR((*stream).memcpy((char *)tensor->data + offset, host_buf, size) .wait()): Meet error in this line code!\r\n  in function ggml_backend_sycl_buffer_set_tensor at D:\\actions-runner\\release-cpp-oneapi_2024_2\\_work\\llm.cpp\\llm.cpp\\ollama-llama-cpp\\ggml\\src\\ggml-sycl\\ggml-sycl.cpp:345\r\nD:\\actions-runner\\release-cpp-oneapi_2024_2\\_work\\llm.cpp\\llm.cpp\\ollama-llama-cpp\\ggml\\src\\ggml-sycl\\..\\ggml-sycl\\common.hpp:107: SYCL error"

Environmental variables(OLLAMA_GPU_MEMORY):

OLLAMA_GPU_MEMORY=6144
OLLAMA_NUM_GPU=1
ZES_ENABLE_SYSMAN=1

Without using ipex-llm, you can use Olama to run DeepSeek-R1- 14B normally, but without using a GPU.

How can I configure to use a GPU for Olama inference (DeepSeek-R1: 14b)? Thank you.

(UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY) in your log means out of memory. Pure CPU can use all your memory, but iGPU can only share half of total memory.

(UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY) in your log means out of memory. Pure CPU can use all your memory, but iGPU can only share half of total memory.

Thank you, I have identified this issue.

I can switch to DeepSeeker R1:7B and it will run normally.

If using pure CPU inference, DeepSeeker R1:14B can be used smoothly.

My computer has an integrated graphics card. If I purchase a new Intel graphics card (with larger memory), can I infer larger models? (For example: DeepSeek-R1:32B, Or QwQ-32B, can these two models run with 48GB of video memory

The computer configuration is as follows:

Image

Image

Image

Yes, you can run larger models with a dedicated GPU but there are no consumer options with 48GB memory.