intel / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, DeepSeek, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, DeepSpeed, Axolotl, etc.

Repository from Github https://github.comintel/ipex-llmRepository from Github https://github.comintel/ipex-llm

Running `intelanalytics/ipex-llm-inference-cpp-xpu` image with A770 GPU and AMD EPYC CPU

abjugard opened this issue · comments

Describe the bug
I'm getting a bus error (core dumped) when running benchmark in the intelanalytics/ipex-llm-inference-cpp-xpu:latest image on my A770 GPU and AMD EPYC CPU.

How to reproduce
Steps to reproduce the error:

  1. Start the image using the following shell script:
#/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-inference-cpp-xpu:latest

docker run -it --rm \
           --net=host \
           --device=/dev/dri \
           -v /mnt/speedtank/llm/models:/models \
           -e no_proxy=localhost,127.0.0.1 \
           --memory="32G" \
           -e bench_model="mistral-7b-v0.1.Q4_K_M.gguf" \
           -e DEVICE=Arc \
           --shm-size="16g" \
           $DOCKER_IMAGE /bin/bash
  1. Run bash /llm/scripts/benchmark_llama-cpp.sh in the image
  2. Crash

Screenshots

root@brownie:/llm# bash /llm/scripts/benchmark_llama-cpp.sh
found oneapi in /opt/intel/oneapi/setvars.sh

:: initializing oneAPI environment ...
   benchmark_llama-cpp.sh: BASH_VERSION = 5.1.16(1)-release
   args: Using "$@" for setvars.sh arguments: --force
:: advisor -- latest
:: ccl -- latest
:: compiler -- latest
:: dal -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: dnnl -- latest
:: dpcpp-ct -- latest
:: dpl -- latest
:: ipp -- latest
:: ippcp -- latest
:: mkl -- latest
:: mpi -- latest
:: pti -- latest
:: tbb -- latest
:: umf -- latest
:: vtune -- latest
:: oneAPI environment initialized ::

/usr/local/lib/python3.11/dist-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/usr/local/lib/python3.11/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/usr/local/lib/python3.11/dist-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/usr/local/lib/python3.11/dist-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
+++++ Env Variables +++++
Internal:
    ENABLE_IOMP     = 1
    ENABLE_GPU      = 1
    ENABLE_JEMALLOC = 0
    ENABLE_TCMALLOC = 0
    LIB_DIR    = /usr/local/lib
    BIN_DIR    = bin64
    LLM_DIR    = /usr/local/lib/python3.11/dist-packages/ipex_llm

Exported:
    LD_PRELOAD             =
    OMP_NUM_THREADS        =
    MALLOC_CONF            =
    USE_XETLA              = OFF
    ENABLE_SDP_FUSION      =
    SYCL_CACHE_PERSISTENT  = 1
    BIGDL_LLM_XMX_DISABLED =
    SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS = 1
+++++++++++++++++++++++++
Complete.
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
build: 1 (aef9006) with Intel(R) oneAPI DPC++/C++ Compiler 2025.0.4 (2025.0.4.20241205) for x86_64-unknown-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device SYCL0 (Intel(R) Arc(TM) A770 Graphics) - 15473 MiB free
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /models/mistral-7b-v0.1.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 15
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1637 MB
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: f_attn_scale     = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 4.07 GiB (4.83 BPW)
llm_load_print_meta: general.name     = mistralai_mistral-7b-v0.1
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOG token        = 2 '</s>'
llm_load_print_meta: max token length = 48
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:          CPU model buffer size =    70.31 MiB
llm_load_tensors:        SYCL0 model buffer size =  4095.05 MiB
./llm/scripts/benchmark_llama-cpp.sh: line 21:   540 Bus error               (core dumped) ./llama-cli -m $model -n 128 --prompt "${promt_1024_128}" -t 8 -e -ngl 999 --color --ctx-size 1024 --no-mmap --temp 0

If I run llama-cli with the -v flag I get the following output:

root@brownie:/llm/llama-cpp# ./llama-cli -m $model -n 128 --prompt "${promt_1024_128}"  -t 8 -e -ngl 999 --color --ctx-size 1024 --no-mmap --temp 0 -v
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
build: 1 (aef9006) with Intel(R) oneAPI DPC++/C++ Compiler 2025.0.4 (2025.0.4.20241205) for x86_64-unknown-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device SYCL0 (Intel(R) Arc(TM) A770 Graphics) - 15473 MiB free
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /models/mistral-7b-v0.1.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 15
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: control token:      2 '</s>' is not marked as EOG
llm_load_vocab: control token:      1 '<s>' is not marked as EOG
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1637 MB
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: f_attn_scale     = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 4.07 GiB (4.83 BPW)
llm_load_print_meta: general.name     = mistralai_mistral-7b-v0.1
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOG token        = 2 '</s>'
llm_load_print_meta: max token length = 48
load_tensors: layer   0 assigned to device SYCL0, is_swa = 0
load_tensors: layer   1 assigned to device SYCL0, is_swa = 0
load_tensors: layer   2 assigned to device SYCL0, is_swa = 0
load_tensors: layer   3 assigned to device SYCL0, is_swa = 0
load_tensors: layer   4 assigned to device SYCL0, is_swa = 0
load_tensors: layer   5 assigned to device SYCL0, is_swa = 0
load_tensors: layer   6 assigned to device SYCL0, is_swa = 0
load_tensors: layer   7 assigned to device SYCL0, is_swa = 0
load_tensors: layer   8 assigned to device SYCL0, is_swa = 0
load_tensors: layer   9 assigned to device SYCL0, is_swa = 0
load_tensors: layer  10 assigned to device SYCL0, is_swa = 0
load_tensors: layer  11 assigned to device SYCL0, is_swa = 0
load_tensors: layer  12 assigned to device SYCL0, is_swa = 0
load_tensors: layer  13 assigned to device SYCL0, is_swa = 0
load_tensors: layer  14 assigned to device SYCL0, is_swa = 0
load_tensors: layer  15 assigned to device SYCL0, is_swa = 0
load_tensors: layer  16 assigned to device SYCL0, is_swa = 0
load_tensors: layer  17 assigned to device SYCL0, is_swa = 0
load_tensors: layer  18 assigned to device SYCL0, is_swa = 0
load_tensors: layer  19 assigned to device SYCL0, is_swa = 0
load_tensors: layer  20 assigned to device SYCL0, is_swa = 0
load_tensors: layer  21 assigned to device SYCL0, is_swa = 0
load_tensors: layer  22 assigned to device SYCL0, is_swa = 0
load_tensors: layer  23 assigned to device SYCL0, is_swa = 0
load_tensors: layer  24 assigned to device SYCL0, is_swa = 0
load_tensors: layer  25 assigned to device SYCL0, is_swa = 0
load_tensors: layer  26 assigned to device SYCL0, is_swa = 0
load_tensors: layer  27 assigned to device SYCL0, is_swa = 0
load_tensors: layer  28 assigned to device SYCL0, is_swa = 0
load_tensors: layer  29 assigned to device SYCL0, is_swa = 0
load_tensors: layer  30 assigned to device SYCL0, is_swa = 0
load_tensors: layer  31 assigned to device SYCL0, is_swa = 0
load_tensors: layer  32 assigned to device SYCL0, is_swa = 0
llm_load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:          CPU model buffer size =    70.31 MiB
llm_load_tensors:        SYCL0 model buffer size =  4095.05 MiB
load_all_data: no device found for buffer type CPU for async uploads
.Bus error (core dumped)

Any ideas?

Could you share your result of lscpu? I want to know the CPU flags.

Sure, output from lscpu run from inside the ipex-llm Docker image:

Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          45 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   10
  On-line CPU(s) list:    0-9
Vendor ID:                AuthenticAMD
  Model name:             AMD EPYC 7443P 24-Core Processor
    CPU family:           25
    Model:                1
    Thread(s) per core:   1
    Core(s) per socket:   1
    Socket(s):            10
    Stepping:             1
    BogoMIPS:             5699.99
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid e
                          xtd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext ssbd
                          ibrs ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves user_shstk clzero wbnoinvd arat umip pku ospke vaes vpclmulqdq rdpid overflow_reco
                          v succor fsrm
Virtualization features:
  Hypervisor vendor:      VMware
  Virtualization type:    full
Caches (sum of all):
  L1d:                    320 KiB (10 instances)
  L1i:                    320 KiB (10 instances)
  L2:                     5 MiB (10 instances)
  L3:                     320 MiB (10 instances)
NUMA:
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-9
Vulnerabilities:
  Gather data sampling:   Not affected
  Itlb multihit:          Not affected
  L1tf:                   Not affected
  Mds:                    Not affected
  Meltdown:               Not affected
  Mmio stale data:        Not affected
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Vulnerable: Safe RET, no microcode
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
  Srbds:                  Not affected
  Tsx async abort:        Not affected

I guess the failure is caused by lacking of avx_vnni, we have removed this requirement in latest docker image, you can update your image and try again.

Still getting:

root@brownie:/llm/llama-cpp# ./llama-cli -m $model -n 128 --prompt "${promt_1024_128}"  -t 8 -e -ngl 999 --color --ctx-size 1024 --no-mmap --temp 0 -v
build: 1 (3240917) with Intel(R) oneAPI DPC++/C++ Compiler 2025.0.4 (2025.0.4.20241205) for x86_64-unknown-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device SYCL0 (Intel(R) Arc(TM) A770 Graphics) - 15473 MiB free
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /models/mistral-7b-v0.1.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 15
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
print_info: file format = GGUF V2
print_info: file type   = Q4_K - Medium
print_info: file size   = 4.07 GiB (4.83 BPW)
init_tokenizer: initializing tokenizer for type 1
load: control token:      2 '</s>' is not marked as EOG
load: control token:      1 '<s>' is not marked as EOG
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 3
load: token to piece cache size = 0.1637 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 4096
print_info: n_layer          = 32
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 14336
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 7B
print_info: model params     = 7.24 B
print_info: general.name     = mistralai_mistral-7b-v0.1
print_info: vocab type       = SPM
print_info: n_vocab          = 32000
print_info: n_merges         = 0
print_info: BOS token        = 1 '<s>'
print_info: EOS token        = 2 '</s>'
print_info: UNK token        = 0 '<unk>'
print_info: LF token         = 13 '<0x0A>'
print_info: EOG token        = 2 '</s>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: layer   0 assigned to device SYCL0, is_swa = 0
load_tensors: layer   1 assigned to device SYCL0, is_swa = 0
load_tensors: layer   2 assigned to device SYCL0, is_swa = 0
load_tensors: layer   3 assigned to device SYCL0, is_swa = 0
load_tensors: layer   4 assigned to device SYCL0, is_swa = 0
load_tensors: layer   5 assigned to device SYCL0, is_swa = 0
load_tensors: layer   6 assigned to device SYCL0, is_swa = 0
load_tensors: layer   7 assigned to device SYCL0, is_swa = 0
load_tensors: layer   8 assigned to device SYCL0, is_swa = 0
load_tensors: layer   9 assigned to device SYCL0, is_swa = 0
load_tensors: layer  10 assigned to device SYCL0, is_swa = 0
load_tensors: layer  11 assigned to device SYCL0, is_swa = 0
load_tensors: layer  12 assigned to device SYCL0, is_swa = 0
load_tensors: layer  13 assigned to device SYCL0, is_swa = 0
load_tensors: layer  14 assigned to device SYCL0, is_swa = 0
load_tensors: layer  15 assigned to device SYCL0, is_swa = 0
load_tensors: layer  16 assigned to device SYCL0, is_swa = 0
load_tensors: layer  17 assigned to device SYCL0, is_swa = 0
load_tensors: layer  18 assigned to device SYCL0, is_swa = 0
load_tensors: layer  19 assigned to device SYCL0, is_swa = 0
load_tensors: layer  20 assigned to device SYCL0, is_swa = 0
load_tensors: layer  21 assigned to device SYCL0, is_swa = 0
load_tensors: layer  22 assigned to device SYCL0, is_swa = 0
load_tensors: layer  23 assigned to device SYCL0, is_swa = 0
load_tensors: layer  24 assigned to device SYCL0, is_swa = 0
load_tensors: layer  25 assigned to device SYCL0, is_swa = 0
load_tensors: layer  26 assigned to device SYCL0, is_swa = 0
load_tensors: layer  27 assigned to device SYCL0, is_swa = 0
load_tensors: layer  28 assigned to device SYCL0, is_swa = 0
load_tensors: layer  29 assigned to device SYCL0, is_swa = 0
load_tensors: layer  30 assigned to device SYCL0, is_swa = 0
load_tensors: layer  31 assigned to device SYCL0, is_swa = 0
load_tensors: layer  32 assigned to device SYCL0, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors:          CPU model buffer size =    70.31 MiB
load_tensors:        SYCL0 model buffer size =  4095.05 MiB
load_all_data: no device found for buffer type CPU for async uploads
load_all_data: disbale async uploads for ipex-llm sycl backend
.Bus error (core dumped)

Running image sha256:d46fa7e9fb68d568c5427a1dffe1c257121b4efecf9a87f88057e23f9e6e7846

Any other ideas?

also epyc cpu,same issue,any progress?

#10955
i think...maybe Resizable BAR was disable
my motherboard need mod bios ,i will try it

In my BIOS Resizable BAR was disabled as well. Will try again and report. #13029

Hmm, that might be it, mine is a Supermicro H12SSL-CT running ESXi, I seem to recall I couldn't figure out how to enable Resizable BAR, don't remember if it was UEFI or ESXi that was the problem.

Will take another look at that and see if I can get it enabled and if that helps things.

I guess the failure is caused by lacking of avx_vnni, we have removed this requirement in latest docker image, you can update your image and try again.

latest docker image runs well for me but i dont like it because its running in VMware already. can you release a new build? @qiuxin2012

In my case it was indeed the BIOS setting resizable BAR (should be enabled or auto). It now works perfectly.

I guess the failure is caused by lacking of avx_vnni, we have removed this requirement in latest docker image, you can update your image and try again.

latest docker image runs well for me but i dont like it because its running in VMware already. can you release a new build? @qiuxin2012

You may use the following steps to build docker image

git clone https://github.com/intel/ipex-llm.git        
cd ipex-llm/docker/llm/inference-cpp/
sudo docker build \
  --no-cache=true \
  -t intelanalytics/ipex-llm-inference-cpp-xpu:latest -f ./Dockerfile .

Can confirm that this was it for me too, simply didn't have Resizable BAR enabled in firmware (well, it didn't exist in the firmware version my board was running, so had to upgrade that first, and then muck about with advanced VM parameters), after doing so everything is peachy!

Thanks so much @Ksdb104 for pointing me/us in that direction!