intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, Axolotl, etc.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

IPEX-LLM(llama.cpp) met core dump when run Qwen-7B-Q4_K_M.gguf on Intel ARC770

jianweimama opened this issue · comments

IPEX-LLM Llama cpp操作步骤如下:
1.Install OneAPI
#wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null

#echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list

#sudo apt update

#sudo apt install intel-oneapi-common-vars=2024.0.0-49406
intel-oneapi-common-oneapi-vars=2024.0.0-49406
intel-oneapi-diagnostics-utility=2024.0.0-49093
intel-oneapi-compiler-dpcpp-cpp=2024.0.2-49895
intel-oneapi-dpcpp-ct=2024.0.0-49381
intel-oneapi-mkl=2024.0.0-49656
intel-oneapi-mkl-devel=2024.0.0-49656
intel-oneapi-mpi=2021.11.0-49493
intel-oneapi-mpi-devel=2021.11.0-49493
intel-oneapi-dal=2024.0.1-25
intel-oneapi-dal-devel=2024.0.1-25
intel-oneapi-ippcp=2021.9.1-5
intel-oneapi-ippcp-devel=2021.9.1-5
intel-oneapi-ipp=2021.10.1-13
intel-oneapi-ipp-devel=2021.10.1-13
intel-oneapi-tlt=2024.0.0-352
intel-oneapi-ccl=2021.11.2-5
intel-oneapi-ccl-devel=2021.11.2-5
intel-oneapi-dnnl-devel=2024.0.0-49521
intel-oneapi-dnnl=2024.0.0-49521
intel-oneapi-tcm-1.0=1.0.0-435

  1. Setup Python Environment
    #wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
    #bash Miniforge3-Linux-x86_64.sh

conda create -n llm python=3.11

conda activate llm

  1. Install IPEX-LLM for llama.cpp
    #pip install --pre --upgrade ipex-llm[cpp]

4.Setup for running llama.cpp
#mkdir llama-cpp
#cd llama-cpp
(llm) llama-cpp# init-llama-cpp
(llm) llama-cpp# ls
baby-llama beam-search convert-llama2c-to-ggml export-lora gguf-py infill lookahead main perplexity quantize-stats simple train-text-from-scratch
batched benchmark convert.py finetune gritlm llama-bench lookup parallel q8dot save-load-state speculative vdot
batched-bench convert-hf-to-gguf.py embedding gguf imatrix llava-cli ls-sycl-device passkey quantize server tokenize

  1. Runtime Configuration
    #source /opt/intel/oneapi/setvars.sh
    #export SYCL_CACHE_PERSISTENT=1

6.Run the quantized model
(llm)llama-cpp# ./main -m Qwen-7B-Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -t 8 -e -ngl 33 --color
Log start
main: build = 1 (9140e0f)
main: built with Intel(R) oneAPI DPC++/C++ Compiler 2024.0.0 (2024.0.0.20231017) for x86_64-unknown-linux-gnu
main: seed = 1717580760
llama_model_loader: loaded meta data with 20 key-value pairs and 259 tensors from Qwen-7B-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen
llama_model_loader: - kv 1: general.name str = Qwen
llama_model_loader: - kv 2: qwen.context_length u32 = 8192
llama_model_loader: - kv 3: qwen.block_count u32 = 32
llama_model_loader: - kv 4: qwen.embedding_length u32 = 4096
llama_model_loader: - kv 5: qwen.feed_forward_length u32 = 22016
llama_model_loader: - kv 6: qwen.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 7: qwen.rope.dimension_count u32 = 128
llama_model_loader: - kv 8: qwen.attention.head_count u32 = 32
llama_model_loader: - kv 9: qwen.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: general.file_type u32 = 15
llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 151643
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 151643
llama_model_loader: - kv 19: general.quantization_version u32 = 2
llama_model_loader: - type f32: 97 tensors
llama_model_loader: - type q4_K: 113 tensors
llama_model_loader: - type q5_K: 32 tensors
llama_model_loader: - type q6_K: 17 tensors
llm_load_vocab: special tokens definition check successful ( 293/151936 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 151936
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: n_ctx_train = 8192
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 4096
llm_load_print_meta: n_embd_v_gqa = 4096
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 22016
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 8192
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 7.72 B
llm_load_print_meta: model size = 4.56 GiB (5.07 BPW)
llm_load_print_meta: general.name = Qwen
llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token = 151643 '<|endoftext|>'
llm_load_print_meta: UNK token = 151643 '<|endoftext|>'
llm_load_print_meta: LF token = 148848 'ÄĬ'
llm_load_print_meta: EOT token = 151645 '<|im_end|>'
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 3 SYCL devices:
| | | | |Max | |Max |Global | |
| | | | |compute|Max work|sub |mem | |

ID Device Type Name Version units group group size Driver version
0 [level_zero:gpu:0] Intel Arc A770 Graphics 1.3 512 1024 32 16225M 1.3.28717
1 [opencl:cpu:0] Intel Xeon Gold 6438N 3.0 128 8192 64 270034M 2023.16.12.0.12_195853.xmain-hotfix
2 [opencl:acc:0] Intel FPGA Emulation Device 1.2 128 67108864 64 270034M 2023.16.12.0.12_195853.xmain-hotfix
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:512
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llm_load_tensors: ggml ctx size = 0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: SYCL0 buffer size = 4332.75 MiB
llm_load_tensors: CPU buffer size = 333.84 MiB
....................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: SYCL0 KV buffer size = 256.00 MiB
llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_new_context_with_model: SYCL_Host output buffer size = 0.58 MiB
llama_new_context_with_model: SYCL0 compute buffer size = 304.75 MiB
llama_new_context_with_model: SYCL_Host compute buffer size = 9.01 MiB
llama_new_context_with_model: graph nodes = 1222
llama_new_context_with_model: graph splits = 2
oneapi::mkl::oneapi::mkl::blas::gemm: cannot allocate memory on host
Exception caught at file:/home/runner/_work/llm.cpp/llm.cpp/llama-cpp-bigdl/ggml-sycl.cpp, line:15299, func:operator()
SYCL error: CHECK_TRY_ERROR(dpct::gemm_batch( *g_sycl_handles[g_main_device], oneapi::mkl::transpose::trans, oneapi::mkl::transpose::nontrans, ne01, ne11, ne10, alpha, (const void **)(ptrs_src.get() + 0 * ne23), dpct::library_data_t::real_half, nb01 / nb00, (const void **)(ptrs_src.get() + 1 * ne23), dpct::library_data_t::real_half, nb11 / nb10, beta, (void **)(ptrs_dst.get() + 0 * ne23), cu_data_type, ne01, ne23, cu_compute_type)): Meet error in this line code!
in function ggml_sycl_mul_mat_batched_sycl at /home/runner/_work/llm.cpp/llm.cpp/llama-cpp-bigdl/ggml-sycl.cpp:15299
GGML_ASSERT: /home/runner/_work/llm.cpp/llm.cpp/llama-cpp-bigdl/ggml-sycl.cpp:3021: !"SYCL error"
[New LWP 11323]
[New LWP 11324]
[New LWP 11325]
......
New LWP 11449]
[New LWP 11450]
warning: File "/opt/intel/oneapi/compiler/2024.0/lib/libsycl.so.7.0.0-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
    add-auto-load-safe-path /opt/intel/oneapi/compiler/2024.0/lib/libsycl.so.7.0.0-gdb.py

line to your configuration file "/root/.config/gdb/gdbinit".
To completely disable this security protection add
set auto-load safe-path /
line to your configuration file "/root/.config/gdb/gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual. E.g., run from the shell:
info "(gdb)Auto-loading safe path"
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007d34e78ea42f in __GI___wait4 (pid=11456, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0 0x00007d34e78ea42f in __GI___wait4 (pid=11456, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 in ../sysdeps/unix/sysv/linux/wait4.c
#1 0x0000000000635d16 in ggml_sycl_mul_mat(ggml_tensor const*, ggml_tensor const*, ggml_tensor*) ()
#2 0x0000000000631737 in ggml_sycl_compute_forward(ggml_compute_params*, ggml_tensor*) ()
#3 0x00000000006f599f in ggml_backend_sycl_graph_compute(ggml_backend*, ggml_cgraph*) ()
#4 0x00000000005e5698 in ggml_backend_sched_graph_compute_async ()
#5 0x00000000004e7f0c in llama_decode ()
#6 0x000000000044cc0c in llama_init_from_gpt_params(gpt_params&) ()
#7 0x000000000043670e in main ()
[Inferior 1 (process 11321) detached]
Aborted (core dumped)

上面不清楚的地方,重新贴一下.

Exception caught at file:/home/runner/_work/llm.cpp/llm.cpp/llama-cpp-bigdl/ggml-sycl.cpp, line:15299, func:operator()
SYCL error: CHECK_TRY_ERROR(dpct::gemm_batch( *g_sycl_handles[g_main_device], oneapi::mkl::transpose::trans, oneapi::mkl::transpose::nontrans, ne01, ne11, ne10, alpha, (const void **)(ptrs_src.get() + 0 * ne23), dpct::library_data_t::real_half, nb01 / nb00, (const void **)(ptrs_src.get() + 1 * ne23), dpct::library_data_t::real_half, nb11 / nb10, beta, (void **)(ptrs_dst.get() + 0 * ne23), cu_data_type, ne01, ne23, cu_compute_type)): Meet error in this line code!
in function ggml_sycl_mul_mat_batched_sycl at /home/runner/_work/llm.cpp/llm.cpp/llama-cpp-bigdl/ggml-sycl.cpp:15299
GGML_ASSERT: /home/runner/_work/llm.cpp/llm.cpp/llama-cpp-bigdl/ggml-sycl.cpp:3021: !"SYCL error"

Hi @jianweimama, I failed to reproduce the error on our machine.
The model I used is https://huggingface.co/RichardErkhov/Qwen_-_Qwen-7B-gguf/tree/main?show_file_info=Qwen-7B.Q4_K_M.gguf, and I follow the same step as your comment (except the installing oneAPI part).

Please follow this guide to reinstall oneAPI: https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#id1 (Intel® oneAPI Base Toolkit 2024.0 installation methods: part), and try again.

Here is my log:

(yina-llm) arda@arda-arc18:~/yina/llama-cpp$ ./main -m /mnt/disk1/models/gguf_models/Qwen-7B.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -t 8 -e -ngl 33 --color
Log start
main: build = 1 (874d454)
main: built with Intel(R) oneAPI DPC++/C++ Compiler 2024.0.0 (2024.0.0.20231017) for x86_64-unknown-linux-gnu
main: seed  = 1717776653
llama_model_loader: loaded meta data with 20 key-value pairs and 259 tensors from /mnt/disk1/models/gguf_models/Qwen-7B.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen
llama_model_loader: - kv   1:                               general.name str              = Qwen
llama_model_loader: - kv   2:                        qwen.context_length u32              = 32768
llama_model_loader: - kv   3:                           qwen.block_count u32              = 32
llama_model_loader: - kv   4:                      qwen.embedding_length u32              = 4096
llama_model_loader: - kv   5:                   qwen.feed_forward_length u32              = 22016
llama_model_loader: - kv   6:                        qwen.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv   7:                  qwen.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                  qwen.attention.head_count u32              = 32
llama_model_loader: - kv   9:      qwen.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 15
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  12:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 151643
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   97 tensors
llama_model_loader: - type q4_K:  113 tensors
llama_model_loader: - type q5_K:   32 tensors
llama_model_loader: - type q6_K:   17 tensors
llm_load_vocab: special tokens definition check successful ( 293/151936 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 22016
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 7.72 B
llm_load_print_meta: model size       = 4.56 GiB (5.07 BPW) 
llm_load_print_meta: general.name     = Qwen
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: UNK token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|    1.3|    512|    1024|   32| 16225M|            1.3.28202|
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:512
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  4332.75 MiB
llm_load_tensors:        CPU buffer size =   333.84 MiB
....................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      SYCL0 KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =   304.75 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =     9.01 MiB
llama_new_context_with_model: graph nodes  = 1222
llama_new_context_with_model: graph splits = 2

system_info: n_threads = 8 / 32 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 512, n_batch = 2048, n_predict = 32, n_keep = 0


Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun all the time. But there was one problem – she was very shy. So she decided to take a different approach to her adventures.
She would take a deep
llama_print_timings:        load time =    1966.84 ms
llama_print_timings:      sample time =       1.55 ms /    32 runs   (    0.05 ms per token, 20671.83 tokens per second)
llama_print_timings: prompt eval time =     181.95 ms /    31 tokens (    5.87 ms per token,   170.38 tokens per second)
llama_print_timings:        eval time =     669.68 ms /    31 runs   (   21.60 ms per token,    46.29 tokens per second)
llama_print_timings:       total time =     864.59 ms /    62 tokens
Log end

I also found the level_zero driver version mismatch. Please try to reinstall the GPU driver.
Driver Version: I915_24.1.11_PSB_240117.14
image

Updated GPU driver with recommended I915_24.1.11_PSB_240117.14, it works now.
thanks for your prompt response.