ollama ps shows CPU use though start-ollama.sh indicates that model is loaded into GPU?

Question

ollama ps shows CPU use though start-ollama.sh indicates that model is loaded into GPU?

chbacher opened this issue 7 months ago · comments

Hi everyone,

I am not quite sure if this is really an issue or a wrong configuration from my side?

However, when firing up start-ollama.sh I get the following output (that looks to me that the model is somehow loaded into the GPU (or not?).

However, when checking in a seperate terminal with ollama ps it shows:
./ollama ps
ggml_sycl_init: found 1 SYCL devices:
NAME ID SIZE PROCESSOR UNTIL
hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:latest 6ab7dc0572eb 8.1 GB 100% CPU 9 minutes from now

Do I misunderstand something here (or using the wrong parameters)? Thanks.

time=2025-03-27T16:09:29.379Z level=INFO source=server.go:104 msg="system memory" total="8.0 GiB" free="7.8 GiB" free_swap="0 B"
time=2025-03-27T16:09:29.379Z level=INFO source=memory.go:356 msg="offload to device" layers.requested=-1 layers.model=33 layers.offload=0 layers.split="" memory.available="[3.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="7.8 GiB" memory.required.partial="0 B" memory.required.kv="2.0 GiB" memory.required.allocations="[3.0 GiB]" memory.weights.total="5.9 GiB" memory.weights.repeating="5.5 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB"
time=2025-03-27T16:09:29.379Z level=INFO source=server.go:392 msg="starting llama server" cmd="/root/ollama-ipex-llm-2.2.0b20250318-ubuntu/ollama-bin runner --model /root/.ollama/models/blobs/sha256-f8eba201522ab44b79bc54166126bfaf836111ff4cbf2d13c59c3b57da10573b --ctx-size 16384 --batch-size 512 --n-gpu-layers 999 --threads 4 --no-mmap --parallel 1 --port 39469"
time=2025-03-27T16:09:29.379Z level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-03-27T16:09:29.379Z level=INFO source=server.go:571 msg="waiting for llama runner to start responding"
time=2025-03-27T16:09:29.380Z level=INFO source=server.go:605 msg="waiting for server to become available" status="llm server error"
ggml_sycl_init: found 1 SYCL devices:
time=2025-03-27T16:09:29.443Z level=INFO source=runner.go:967 msg="starting go runner"
time=2025-03-27T16:09:29.443Z level=INFO source=runner.go:968 msg=system info="CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=4
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llama_load_model_from_file: using device SYCL0 (Intel(R) Graphics) - 14374 MiB free
time=2025-03-27T16:09:29.443Z level=INFO source=runner.go:1026 msg="Server listening on 127.0.0.1:39469"
llama_model_loader: loaded meta data with 32 key-value pairs and 292 tensors from /root/.ollama/models/blobs/sha256-f8eba201522ab44b79bc54166126bfaf836111ff4cbf2d13c59c3b57da10573b (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Llama 8B
llama_model_loader: - kv   3:                       general.organization str              = Deepseek Ai
llama_model_loader: - kv   4:                           general.basename str              = DeepSeek-R1-Distill-Llama
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                          llama.block_count u32              = 32
llama_model_loader: - kv   7:                       llama.context_length u32              = 131072
llama_model_loader: - kv   8:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   9:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  10:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  11:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  12:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  13:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  14:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  15:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  16:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  17:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  18:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  19:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  20:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  21:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  22:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  23:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  24:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  25:            tokenizer.ggml.padding_token_id u32              = 128004
llama_model_loader: - kv  26:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  27:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  29:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  30:               general.quantization_version u32              = 2
llama_model_loader: - kv  31:                          general.file_type u32              = 15
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
time=2025-03-27T16:09:29.630Z level=INFO source=server.go:605 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: f_attn_scale     = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.58 GiB (4.89 BPW) 
llm_load_print_meta: general.name     = DeepSeek R1 Distill Llama 8B
llm_load_print_meta: BOS token        = 128000 '<｜begin▁of▁sentence｜>'
llm_load_print_meta: EOS token        = 128001 '<｜end▁of▁sentence｜>'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
llm_load_print_meta: PAD token        = 128004 '<|finetune_right_pad_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOG token        = 128001 '<｜end▁of▁sentence｜>'
llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        SYCL0 model buffer size =  4403.49 MiB
llm_load_tensors:          CPU model buffer size =   281.81 MiB
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 16384
llama_new_context_with_model: n_ctx_per_seq = 16384
llama_new_context_with_model: n_batch       = 512
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 500000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
Found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                         Intel Graphics|   12.4|     32|     512|   32| 15072M|         1.6.32224+14|
llama_kv_cache_init:      SYCL0 KV buffer size =  2048.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.50 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =   258.50 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    40.01 MiB
llama_new_context_with_model: graph nodes  = 902
llama_new_context_with_model: graph splits = 2
time=2025-03-27T16:09:39.335Z level=WARN source=runner.go:892 msg="%s: warming up the model with an empty run - please wait ... " !BADKEY=loadModel
time=2025-03-27T16:09:39.415Z level=INFO source=server.go:610 msg="llama runner started in 10.04 seconds"
[GIN] 2025/03/27 - 16:09:58 | 200 | 28.931585073s |       127.0.0.1 | POST     "/api/chat"

suomi2024 · Answer 1 · Fri Mar 28 2025 03:16:07 GMT+0800 (China Standard Time)

https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quickstart/ollama_portable_zip_quickstart.md

https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quickstart/ollama_portable_zip_quickstart.zh-CN.md

suomi2024 · Answer 2 · Fri Mar 28 2025 03:18:30 GMT+0800 (China Standard Time)

https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quickstart/llamacpp_portable_zip_gpu_quickstart.md

suomi2024 · Answer 3 · Fri Mar 28 2025 03:19:21 GMT+0800 (China Standard Time)

Select specific GPU(s) to run Ollama when multiple ones are available
If your machine has multiple Intel GPUs, Ollama will by default runs on all of them.

To specify which Intel GPU(s) you would like Ollama to use, you could set environment variable ONEAPI_DEVICE_SELECTOR before starting Ollama Serve, as follows (if Ollama serve is already running, please make sure to stop it first):

Identify the id (e.g. 0, 1, etc.) for your multiple GPUs. You could find them in the logs of Ollama serve when loading any models, e.g.:

For Windows users:

Open "Command Prompt", and navigate to the extracted folder by cd /d PATH\TO\EXTRACTED\FOLDER
In the "Command Prompt", set ONEAPI_DEVICE_SELECTOR to define the Intel GPU(s) you want to use, e.g. set ONEAPI_DEVICE_SELECTOR=level_zero:0 (on single Intel GPU), or set ONEAPI_DEVICE_SELECTOR=level_zero:0;level_zero:1 (on multiple Intel GPUs), in which 0, 1 should be changed to your desired GPU id
Start Ollama serve through start-ollama.bat
For Linux users:

In a terminal, navigate to the extracted folder by cd PATH\TO\EXTRACTED\FOLDER
Set ONEAPI_DEVICE_SELECTOR to define the Intel GPU(s) you want to use, e.g. export ONEAPI_DEVICE_SELECTOR=level_zero:0 (on single Intel GPU), or export ONEAPI_DEVICE_SELECTOR="level_zero:0;level_zero:1" (on multiple Intel GPUs), in which 0, 1 should be changed to your desired GPU id
Start Ollama serve through ./start-ollama.sh

suomi2024 · Answer 4 · Fri Mar 28 2025 03:20:35 GMT+0800 (China Standard Time)

set OLLAMA_NUM_GPU=999

chbacher · Answer 5 · Fri Mar 28 2025 21:00:45 GMT+0800 (China Standard Time)

Thanks for your input here. However, I really struggle with this and do not get it working properly.

export OLLAMA_NUM_GPU=999
export no_proxy=localhost,127.0.0.1
export ZES_ENABLE_SYSMAN=1
export SYCL_CACHE_PERSISTENT=1
export OLLAMA_KEEP_ALIVE=10m

Also small models are loaded into the CPU?

The strange thing is also export ZES_ENABLE_SYSMAN=1 is set ... However I get the following warning

get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory

and when I load the model

get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory

Test Tell me a joke?

total duration:       11.529061668s
load duration:        36.011823ms
prompt eval count:    5 token(s)
prompt eval duration: 2.301s
prompt eval rate:     2.17 tokens/s
eval count:           36 token(s)
eval duration:        9.191s
eval rate:            3.92 tokens/s

ggml_sycl_init: found 1 SYCL devices:
NAME                  ID              SIZE      PROCESSOR    UNTIL              
gemma3-ggfu:latest    2a9bc20f0079    7.3 GB    100% CPU     8 minutes from now

Is there another "log" I could provide to check other parameters that could go wrong while loading the model? I am really confused.

Tiffany Pragasam · Answer 6 · Sun Mar 30 2025 12:04:24 GMT+0800 (China Standard Time)

Hello @chbacher! You can install and use the intel_gpu_top command to check if the GPU is being used for inference.

If you still do not see that the GPU is being used. Please try setting ONEAPI_DEVICE_SELECTOR as mentioned by @suomi2024 . For example:

export ONEAPI_DEVICE_SELECTOR=level_zero:0

tombii · Answer 7 · Tue Apr 01 2025 03:06:40 GMT+0800 (China Standard Time)

Yeah it's a cosmetic bug. It shows 100% CPU for me as well but with intel_gpu_top you can verify that it runs on GPU.