UHD Graphics 730 run DeepSeek-R1-14B error
pruidong opened this issue · comments
./start-ollama.bat
.\ollama run deepseek-r1:14b
error log:
time=2025-03-24T21:28:12.267+08:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-03-24T21:28:12.267+08:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=6 efficiency=0 threads=12
time=2025-03-24T21:28:12.326+08:00 level=INFO source=server.go:104 msg="system memory" total="15.8 GiB" free="8.2 GiB" free_swap="7.2 GiB"
time=2025-03-24T21:28:12.328+08:00 level=INFO source=memory.go:356 msg="offload to device" layers.requested=-1 layers.model=49 layers.offload=0 layers.split="" memory.available="[8.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="9.4 GiB" memory.required.partial="0 B" memory.required.kv="384.0 MiB" memory.required.allocations="[8.1 GiB]" memory.weights.total="7.7 GiB" memory.weights.repeating="7.1 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="307.0 MiB" memory.graph.partial="916.1 MiB"
time=2025-03-24T21:28:12.338+08:00 level=INFO source=server.go:392 msg="starting llama server" cmd="F:\\Soft\\ollama-ipex-llm-2.2.0\\ollama-lib.exe runner --model F:\\Soft\\Ollama\\Models\\blobs\\sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e --ctx-size 2048 --batch-size 512 --n-gpu-layers 999 --threads 6 --no-mmap --parallel 1 --port 58205"
time=2025-03-24T21:28:12.389+08:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-03-24T21:28:12.389+08:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding"
time=2025-03-24T21:28:12.389+08:00 level=INFO source=server.go:605 msg="waiting for server to become available" status="llm server error"
ggml_sycl_init: found 1 SYCL devices:
time=2025-03-24T21:28:12.657+08:00 level=INFO source=runner.go:967 msg="starting go runner"
time=2025-03-24T21:28:12.663+08:00 level=INFO source=runner.go:968 msg=system info="CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=6
time=2025-03-24T21:28:12.664+08:00 level=INFO source=runner.go:1026 msg="Server listening on 127.0.0.1:58205"
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llama_load_model_from_file: using device SYCL0 (Intel(R) UHD Graphics 730) - 7144 MiB free
llama_model_loader: loaded meta data with 26 key-value pairs and 579 tensors from F:\Soft\Ollama\Models\blobs\sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = DeepSeek R1 Distill Qwen 14B
llama_model_loader: - kv 3: general.basename str = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv 4: general.size_label str = 14B
llama_model_loader: - kv 5: qwen2.block_count u32 = 48
llama_model_loader: - kv 6: qwen2.context_length u32 = 131072
llama_model_loader: - kv 7: qwen2.embedding_length u32 = 5120
llama_model_loader: - kv 8: qwen2.feed_forward_length u32 = 13824
llama_model_loader: - kv 9: qwen2.attention.head_count u32 = 40
llama_model_loader: - kv 10: qwen2.attention.head_count_kv u32 = 8
llama_model_loader: - kv 11: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 12: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 13: general.file_type u32 = 15
llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 15: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 151646
llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151643
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 24: tokenizer.chat_template str = {% if not add_generation_prompt is de...
llama_model_loader: - kv 25: general.quantization_version u32 = 2
llama_model_loader: - type f32: 241 tensors
llama_model_loader: - type q4_K: 289 tensors
llama_model_loader: - type q6_K: 49 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 22
time=2025-03-24T21:28:12.892+08:00 level=INFO source=server.go:605 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 152064
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_layer = 48
llm_load_print_meta: n_head = 40
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 5
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: f_attn_scale = 0.0e+00
llm_load_print_meta: n_ff = 13824
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 14B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 14.77 B
llm_load_print_meta: model size = 8.37 GiB (4.87 BPW)
llm_load_print_meta: general.name = DeepSeek R1 Distill Qwen 14B
llm_load_print_meta: BOS token = 151646 '<|begin▁of▁sentence|>'
llm_load_print_meta: EOS token = 151643 '<|end▁of▁sentence|>'
llm_load_print_meta: EOT token = 151643 '<|end▁of▁sentence|>'
llm_load_print_meta: PAD token = 151643 '<|end▁of▁sentence|>'
llm_load_print_meta: LF token = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token = 151643 '<|end▁of▁sentence|>'
llm_load_print_meta: EOG token = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llm_load_tensors: offloading 48 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 49/49 layers to GPU
llm_load_tensors: SYCL0 model buffer size = 8148.38 MiB
llm_load_tensors: CPU model buffer size = 417.66 MiB
Native API failed. Native API returns: 39 (UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY)
Exception caught at file:D:\actions-runner\release-cpp-oneapi_2024_2\_work\llm.cpp\llm.cpp\ollama-llama-cpp\ggml\src\ggml-sycl\ggml-sycl.cpp, line:345, func:operator()
SYCL error: CHECK_TRY_ERROR((*stream).memcpy((char *)tensor->data + offset, host_buf, size) .wait()): Meet error in this line code!
in function ggml_backend_sycl_buffer_set_tensor at D:\actions-runner\release-cpp-oneapi_2024_2\_work\llm.cpp\llm.cpp\ollama-llama-cpp\ggml\src\ggml-sycl\ggml-sycl.cpp:345
D:\actions-runner\release-cpp-oneapi_2024_2\_work\llm.cpp\llm.cpp\ollama-llama-cpp\ggml\src\ggml-sycl\..\ggml-sycl\common.hpp:107: SYCL error
time=2025-03-24T21:28:14.094+08:00 level=INFO source=server.go:605 msg="waiting for server to become available" status="llm server not responding"
time=2025-03-24T21:28:14.535+08:00 level=INFO source=server.go:605 msg="waiting for server to become available" status="llm server error"
time=2025-03-24T21:28:14.785+08:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: error:CHECK_TRY_ERROR((*stream).memcpy((char *)tensor->data + offset, host_buf, size) .wait()): Meet error in this line code!\r\n in function ggml_backend_sycl_buffer_set_tensor at D:\\actions-runner\\release-cpp-oneapi_2024_2\\_work\\llm.cpp\\llm.cpp\\ollama-llama-cpp\\ggml\\src\\ggml-sycl\\ggml-sycl.cpp:345\r\nD:\\actions-runner\\release-cpp-oneapi_2024_2\\_work\\llm.cpp\\llm.cpp\\ollama-llama-cpp\\ggml\\src\\ggml-sycl\\..\\ggml-sycl\\common.hpp:107: SYCL error"
Environmental variables(OLLAMA_GPU_MEMORY):
OLLAMA_GPU_MEMORY=6144
OLLAMA_NUM_GPU=1
ZES_ENABLE_SYSMAN=1
Without using ipex-llm, you can use Olama to run DeepSeek-R1- 14B normally, but without using a GPU.
How can I configure to use a GPU for Olama inference (DeepSeek-R1: 14b)? Thank you.
(UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY) in your log means out of memory. Pure CPU can use all your memory, but iGPU can only share half of total memory.
(UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY)in your log means out of memory. Pure CPU can use all your memory, but iGPU can only share half of total memory.
Thank you, I have identified this issue.
I can switch to DeepSeeker R1:7B and it will run normally.
If using pure CPU inference, DeepSeeker R1:14B can be used smoothly.
My computer has an integrated graphics card. If I purchase a new Intel graphics card (with larger memory), can I infer larger models? (For example: DeepSeek-R1:32B, Or QwQ-32B, can these two models run with 48GB of video memory
The computer configuration is as follows:
Yes, you can run larger models with a dedicated GPU but there are no consumer options with 48GB memory.


