Running `intelanalytics/ipex-llm-inference-cpp-xpu` image with A770 GPU and AMD EPYC CPU
abjugard opened this issue · comments
Describe the bug
I'm getting a bus error (core dumped) when running benchmark in the intelanalytics/ipex-llm-inference-cpp-xpu:latest image on my A770 GPU and AMD EPYC CPU.
How to reproduce
Steps to reproduce the error:
- Start the image using the following shell script:
#/bin/bash
export DOCKER_IMAGE=intelanalytics/ipex-llm-inference-cpp-xpu:latest
docker run -it --rm \
--net=host \
--device=/dev/dri \
-v /mnt/speedtank/llm/models:/models \
-e no_proxy=localhost,127.0.0.1 \
--memory="32G" \
-e bench_model="mistral-7b-v0.1.Q4_K_M.gguf" \
-e DEVICE=Arc \
--shm-size="16g" \
$DOCKER_IMAGE /bin/bash- Run
bash /llm/scripts/benchmark_llama-cpp.shin the image - Crash
Screenshots
root@brownie:/llm# bash /llm/scripts/benchmark_llama-cpp.sh
found oneapi in /opt/intel/oneapi/setvars.sh
:: initializing oneAPI environment ...
benchmark_llama-cpp.sh: BASH_VERSION = 5.1.16(1)-release
args: Using "$@" for setvars.sh arguments: --force
:: advisor -- latest
:: ccl -- latest
:: compiler -- latest
:: dal -- latest
:: debugger -- latest
:: dev-utilities -- latest
:: dnnl -- latest
:: dpcpp-ct -- latest
:: dpl -- latest
:: ipp -- latest
:: ippcp -- latest
:: mkl -- latest
:: mpi -- latest
:: pti -- latest
:: tbb -- latest
:: umf -- latest
:: vtune -- latest
:: oneAPI environment initialized ::
/usr/local/lib/python3.11/dist-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
/usr/local/lib/python3.11/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/usr/local/lib/python3.11/dist-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
/usr/local/lib/python3.11/dist-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_torch_pytree._register_pytree_node(
+++++ Env Variables +++++
Internal:
ENABLE_IOMP = 1
ENABLE_GPU = 1
ENABLE_JEMALLOC = 0
ENABLE_TCMALLOC = 0
LIB_DIR = /usr/local/lib
BIN_DIR = bin64
LLM_DIR = /usr/local/lib/python3.11/dist-packages/ipex_llm
Exported:
LD_PRELOAD =
OMP_NUM_THREADS =
MALLOC_CONF =
USE_XETLA = OFF
ENABLE_SDP_FUSION =
SYCL_CACHE_PERSISTENT = 1
BIGDL_LLM_XMX_DISABLED =
SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS = 1
+++++++++++++++++++++++++
Complete.
ggml_sycl_init: GGML_SYCL_FORCE_MMQ: no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
build: 1 (aef9006) with Intel(R) oneAPI DPC++/C++ Compiler 2025.0.4 (2025.0.4.20241205) for x86_64-unknown-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device SYCL0 (Intel(R) Arc(TM) A770 Graphics) - 15473 MiB free
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /models/mistral-7b-v0.1.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = mistralai_mistral-7b-v0.1
llama_model_loader: - kv 2: llama.context_length u32 = 32768
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: general.file_type u32 = 15
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1637 MB
llm_load_print_meta: format = GGUF V2
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: f_attn_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 7.24 B
llm_load_print_meta: model size = 4.07 GiB (4.83 BPW)
llm_load_print_meta: general.name = mistralai_mistral-7b-v0.1
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: EOG token = 2 '</s>'
llm_load_print_meta: max token length = 48
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU model buffer size = 70.31 MiB
llm_load_tensors: SYCL0 model buffer size = 4095.05 MiB
./llm/scripts/benchmark_llama-cpp.sh: line 21: 540 Bus error (core dumped) ./llama-cli -m $model -n 128 --prompt "${promt_1024_128}" -t 8 -e -ngl 999 --color --ctx-size 1024 --no-mmap --temp 0
If I run llama-cli with the -v flag I get the following output:
root@brownie:/llm/llama-cpp# ./llama-cli -m $model -n 128 --prompt "${promt_1024_128}" -t 8 -e -ngl 999 --color --ctx-size 1024 --no-mmap --temp 0 -v
ggml_sycl_init: GGML_SYCL_FORCE_MMQ: no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
build: 1 (aef9006) with Intel(R) oneAPI DPC++/C++ Compiler 2025.0.4 (2025.0.4.20241205) for x86_64-unknown-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_load_model_from_file: using device SYCL0 (Intel(R) Arc(TM) A770 Graphics) - 15473 MiB free
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /models/mistral-7b-v0.1.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = mistralai_mistral-7b-v0.1
llama_model_loader: - kv 2: llama.context_length u32 = 32768
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: general.file_type u32 = 15
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors
llm_load_vocab: control token: 2 '</s>' is not marked as EOG
llm_load_vocab: control token: 1 '<s>' is not marked as EOG
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1637 MB
llm_load_print_meta: format = GGUF V2
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: f_attn_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 7.24 B
llm_load_print_meta: model size = 4.07 GiB (4.83 BPW)
llm_load_print_meta: general.name = mistralai_mistral-7b-v0.1
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: EOG token = 2 '</s>'
llm_load_print_meta: max token length = 48
load_tensors: layer 0 assigned to device SYCL0, is_swa = 0
load_tensors: layer 1 assigned to device SYCL0, is_swa = 0
load_tensors: layer 2 assigned to device SYCL0, is_swa = 0
load_tensors: layer 3 assigned to device SYCL0, is_swa = 0
load_tensors: layer 4 assigned to device SYCL0, is_swa = 0
load_tensors: layer 5 assigned to device SYCL0, is_swa = 0
load_tensors: layer 6 assigned to device SYCL0, is_swa = 0
load_tensors: layer 7 assigned to device SYCL0, is_swa = 0
load_tensors: layer 8 assigned to device SYCL0, is_swa = 0
load_tensors: layer 9 assigned to device SYCL0, is_swa = 0
load_tensors: layer 10 assigned to device SYCL0, is_swa = 0
load_tensors: layer 11 assigned to device SYCL0, is_swa = 0
load_tensors: layer 12 assigned to device SYCL0, is_swa = 0
load_tensors: layer 13 assigned to device SYCL0, is_swa = 0
load_tensors: layer 14 assigned to device SYCL0, is_swa = 0
load_tensors: layer 15 assigned to device SYCL0, is_swa = 0
load_tensors: layer 16 assigned to device SYCL0, is_swa = 0
load_tensors: layer 17 assigned to device SYCL0, is_swa = 0
load_tensors: layer 18 assigned to device SYCL0, is_swa = 0
load_tensors: layer 19 assigned to device SYCL0, is_swa = 0
load_tensors: layer 20 assigned to device SYCL0, is_swa = 0
load_tensors: layer 21 assigned to device SYCL0, is_swa = 0
load_tensors: layer 22 assigned to device SYCL0, is_swa = 0
load_tensors: layer 23 assigned to device SYCL0, is_swa = 0
load_tensors: layer 24 assigned to device SYCL0, is_swa = 0
load_tensors: layer 25 assigned to device SYCL0, is_swa = 0
load_tensors: layer 26 assigned to device SYCL0, is_swa = 0
load_tensors: layer 27 assigned to device SYCL0, is_swa = 0
load_tensors: layer 28 assigned to device SYCL0, is_swa = 0
load_tensors: layer 29 assigned to device SYCL0, is_swa = 0
load_tensors: layer 30 assigned to device SYCL0, is_swa = 0
load_tensors: layer 31 assigned to device SYCL0, is_swa = 0
load_tensors: layer 32 assigned to device SYCL0, is_swa = 0
llm_load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU model buffer size = 70.31 MiB
llm_load_tensors: SYCL0 model buffer size = 4095.05 MiB
load_all_data: no device found for buffer type CPU for async uploads
.Bus error (core dumped)
Any ideas?
Could you share your result of lscpu? I want to know the CPU flags.
Sure, output from lscpu run from inside the ipex-llm Docker image:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 45 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 10
On-line CPU(s) list: 0-9
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7443P 24-Core Processor
CPU family: 25
Model: 1
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 10
Stepping: 1
BogoMIPS: 5699.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid e
xtd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext ssbd
ibrs ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves user_shstk clzero wbnoinvd arat umip pku ospke vaes vpclmulqdq rdpid overflow_reco
v succor fsrm
Virtualization features:
Hypervisor vendor: VMware
Virtualization type: full
Caches (sum of all):
L1d: 320 KiB (10 instances)
L1i: 320 KiB (10 instances)
L2: 5 MiB (10 instances)
L3: 320 MiB (10 instances)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-9
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Reg file data sampling: Not affected
Retbleed: Not affected
Spec rstack overflow: Vulnerable: Safe RET, no microcode
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Srbds: Not affected
Tsx async abort: Not affected
I guess the failure is caused by lacking of avx_vnni, we have removed this requirement in latest docker image, you can update your image and try again.
Still getting:
root@brownie:/llm/llama-cpp# ./llama-cli -m $model -n 128 --prompt "${promt_1024_128}" -t 8 -e -ngl 999 --color --ctx-size 1024 --no-mmap --temp 0 -v
build: 1 (3240917) with Intel(R) oneAPI DPC++/C++ Compiler 2025.0.4 (2025.0.4.20241205) for x86_64-unknown-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device SYCL0 (Intel(R) Arc(TM) A770 Graphics) - 15473 MiB free
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /models/mistral-7b-v0.1.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = mistralai_mistral-7b-v0.1
llama_model_loader: - kv 2: llama.context_length u32 = 32768
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: general.file_type u32 = 15
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors
print_info: file format = GGUF V2
print_info: file type = Q4_K - Medium
print_info: file size = 4.07 GiB (4.83 BPW)
init_tokenizer: initializing tokenizer for type 1
load: control token: 2 '</s>' is not marked as EOG
load: control token: 1 '<s>' is not marked as EOG
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 3
load: token to piece cache size = 0.1637 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 32768
print_info: n_embd = 4096
print_info: n_layer = 32
print_info: n_head = 32
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 4
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: n_ff = 14336
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 32768
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 7B
print_info: model params = 7.24 B
print_info: general.name = mistralai_mistral-7b-v0.1
print_info: vocab type = SPM
print_info: n_vocab = 32000
print_info: n_merges = 0
print_info: BOS token = 1 '<s>'
print_info: EOS token = 2 '</s>'
print_info: UNK token = 0 '<unk>'
print_info: LF token = 13 '<0x0A>'
print_info: EOG token = 2 '</s>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: layer 0 assigned to device SYCL0, is_swa = 0
load_tensors: layer 1 assigned to device SYCL0, is_swa = 0
load_tensors: layer 2 assigned to device SYCL0, is_swa = 0
load_tensors: layer 3 assigned to device SYCL0, is_swa = 0
load_tensors: layer 4 assigned to device SYCL0, is_swa = 0
load_tensors: layer 5 assigned to device SYCL0, is_swa = 0
load_tensors: layer 6 assigned to device SYCL0, is_swa = 0
load_tensors: layer 7 assigned to device SYCL0, is_swa = 0
load_tensors: layer 8 assigned to device SYCL0, is_swa = 0
load_tensors: layer 9 assigned to device SYCL0, is_swa = 0
load_tensors: layer 10 assigned to device SYCL0, is_swa = 0
load_tensors: layer 11 assigned to device SYCL0, is_swa = 0
load_tensors: layer 12 assigned to device SYCL0, is_swa = 0
load_tensors: layer 13 assigned to device SYCL0, is_swa = 0
load_tensors: layer 14 assigned to device SYCL0, is_swa = 0
load_tensors: layer 15 assigned to device SYCL0, is_swa = 0
load_tensors: layer 16 assigned to device SYCL0, is_swa = 0
load_tensors: layer 17 assigned to device SYCL0, is_swa = 0
load_tensors: layer 18 assigned to device SYCL0, is_swa = 0
load_tensors: layer 19 assigned to device SYCL0, is_swa = 0
load_tensors: layer 20 assigned to device SYCL0, is_swa = 0
load_tensors: layer 21 assigned to device SYCL0, is_swa = 0
load_tensors: layer 22 assigned to device SYCL0, is_swa = 0
load_tensors: layer 23 assigned to device SYCL0, is_swa = 0
load_tensors: layer 24 assigned to device SYCL0, is_swa = 0
load_tensors: layer 25 assigned to device SYCL0, is_swa = 0
load_tensors: layer 26 assigned to device SYCL0, is_swa = 0
load_tensors: layer 27 assigned to device SYCL0, is_swa = 0
load_tensors: layer 28 assigned to device SYCL0, is_swa = 0
load_tensors: layer 29 assigned to device SYCL0, is_swa = 0
load_tensors: layer 30 assigned to device SYCL0, is_swa = 0
load_tensors: layer 31 assigned to device SYCL0, is_swa = 0
load_tensors: layer 32 assigned to device SYCL0, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors: CPU model buffer size = 70.31 MiB
load_tensors: SYCL0 model buffer size = 4095.05 MiB
load_all_data: no device found for buffer type CPU for async uploads
load_all_data: disbale async uploads for ipex-llm sycl backend
.Bus error (core dumped)
Running image sha256:d46fa7e9fb68d568c5427a1dffe1c257121b4efecf9a87f88057e23f9e6e7846
Any other ideas?
also epyc cpu,same issue,any progress?
#10955
i think...maybe Resizable BAR was disable
my motherboard need mod bios ,i will try it
In my BIOS Resizable BAR was disabled as well. Will try again and report. #13029
Hmm, that might be it, mine is a Supermicro H12SSL-CT running ESXi, I seem to recall I couldn't figure out how to enable Resizable BAR, don't remember if it was UEFI or ESXi that was the problem.
Will take another look at that and see if I can get it enabled and if that helps things.
I guess the failure is caused by lacking of
avx_vnni, we have removed this requirement in latest docker image, you can update your image and try again.
latest docker image runs well for me but i dont like it because its running in VMware already. can you release a new build? @qiuxin2012
In my case it was indeed the BIOS setting resizable BAR (should be enabled or auto). It now works perfectly.
I guess the failure is caused by lacking of
avx_vnni, we have removed this requirement in latest docker image, you can update your image and try again.latest docker image runs well for me but i dont like it because its running in VMware already. can you release a new build? @qiuxin2012
You may use the following steps to build docker image
git clone https://github.com/intel/ipex-llm.git
cd ipex-llm/docker/llm/inference-cpp/
sudo docker build \
--no-cache=true \
-t intelanalytics/ipex-llm-inference-cpp-xpu:latest -f ./Dockerfile .
Can confirm that this was it for me too, simply didn't have Resizable BAR enabled in firmware (well, it didn't exist in the firmware version my board was running, so had to upgrade that first, and then muck about with advanced VM parameters), after doing so everything is peachy!
Thanks so much @Ksdb104 for pointing me/us in that direction!