Gemma 3 Context Shift Causes Gibberish Output (llama.cpp IPEX build)
Sketchfellow opened this issue · comments
Describe the bug
The Intel IPEX-LLM llama.cpp portable build exhibits a bug with Gemma 3 models where the output becomes gibberish after a context shift. This is the same issue as ggml-org/llama.cpp#12357, which has been fixed in the main llama.cpp repository, but the fix does not seem to appear to be incorporated into the current IPEX-LLM build. The issue occurs when the model reaches its context length and attempts to shift the context window. The generated text becomes nonsensical, consisting of repeated characters and phrases.
How to reproduce
Steps to reproduce the error:
- Use the current IPEX-LLM llama.cpp build with Gemma 3 support
- Start the
llama-serverwith the following command (adjust paths as necessary):
set SYCL_CACHE_PERSISTENT=1
set SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=0
set ONEAPI_DEVICE_SELECTOR=level_zero:1
llama-server -m C:\LLM\google_gemma-3-12b-it-Q4_0.gguf -ngl 99 -c 512 -b 512 --temp 0 --seed 0 -n 1000- I used the built-in web UI, but you can also use the API (as shown in the original issue). Send the following prompt:
Tell me a 1000 word math proof - Observe the response. The output text is clear at first but becomes nonsensical and repetitive after some point:
Okay, let's construct a relatively involved proof, aiming for approximately 1000 words. We'll prove a theorem concerning the prime number theorem, specifically a slightly refined version:
Theorem: Let π(x) be the prime-counting function, which gives the number of primes less than or equal to x. The prime number theorem states that π(x) is asymptotic to x / ln(x) as x approaches infinity. We'll prove a slightly stronger result:
For x ≥ 17, |π(x) - x / ln(x)| < 2x / ln²(x).
In simpler terms, we're showing that the difference between the actual number of primes and the approximation x/ln(x) is bounded by 2x/ln²(x). This is a tighter bound than just stating that x/ln(x) is a good approximation, as it provides a measurable error bound.
Proof Outline:
Background - The Prime Number Theorem (PNT) and Chebyshev Functions: We'll introduce the Chebyshev functions θ(x) and ψ(x), which are related to the distribution of primes and are more amenable to analysis than π(x) directly. θ(x) is the sum of the natural logarithms of the primes less than or equal to x, and ψ(x) is the sum of the natural logarithms of all primes less than or equal to x, including with multiplicity (i.e., if p is a prime, it appears in the sum as many times as it appears in the prime factorization of numbers up to x). These functions are closely related to π(x).
Key Lemma – Chebyshev Function Bound: We'll prove a lemma stating that ψ(x) is bounded by x and -x/2, and that x - ψ(x) is not too large. This is a crucial step in establishing the error bound.
Connecting ψ(x) to π(x): We'll use known relationships between ψ(x) and π(x) to express π(x) in terms of ψ(x).
The Error Bound: Finally, we'll combine the bounds on ψ(x) with the expression for π(x) to obtain the ψ(and ψ(x - ψ(x - ψ(x andψ(ψ(and ψ( ψ( ψ(ψ(and to and ψ(ψ(ψ(ψ( ψ(ψ(ψ(ψ(ψ(ψ( ψ(ψ(ψ(ψ(ψ(ψ(ψ(ψ( ψ(ψ(ψ(ψ(ψ(x and ψ(ψ(ψ(ψ(ψ( ψ(ψ( ψ(ψ(ψ(ψ( ψ(ψ(ψ( and ψ( ψ and ψ(ψ(ψ ψ and ψ(ψ( ψ( ψ and the error( and ψ and, and of ψ and ψ(ψ ψ(ψ( and ψ π(ψ ψ ψ and of ψ( ψ ψ ψ and ψ ofψ and ψ and of and ψ and to derive the error,ψ and to and ψ and to get the error and of ψ and ψ and ψ and to and ψψ and and to ofψ of the errorψ and to obtain the desired error and of and and to and to derive and of and of to get and of to derive the desired error of the error bound.
Proof:
Chebyshev Functions:
θ and θ, θ and θ the actual value, the.
2, θ, value
2
2, and θ and
2,
2,
2
2, which is,
2,
2 and the,
2
2
2 and θ
2
2
2
2
2
2
2
Server log output
ggml_sycl_init: GGML_SYCL_FORCE_MMQ: no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
build: 1 (4cfa0b8) with MSVC 19.38.33133.0 for
system info: n_threads = 14, n_threads_batch = 14, total_threads = 20
system_info: n_threads = 14 (n_threads_batch = 14) / 20 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 19
main: loading model
srv load_model: loading model 'C:\LLM\google_gemma-3-12b-it-Q4_0.gguf'
llama_load_model_from_file: using device SYCL0 (Intel(R) Arc(TM) A770M Graphics) - 15473 MiB free
llama_model_loader: loaded meta data with 44 key-value pairs and 626 tensors from C:\LLM\google_gemma-3-12b-it-Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Gemma 3 12b It
llama_model_loader: - kv 3: general.finetune str = it
llama_model_loader: - kv 4: general.basename str = gemma-3
llama_model_loader: - kv 5: general.size_label str = 12B
llama_model_loader: - kv 6: general.license str = gemma
llama_model_loader: - kv 7: general.base_model.count u32 = 1
llama_model_loader: - kv 8: general.base_model.0.name str = Gemma 3 12b Pt
llama_model_loader: - kv 9: general.base_model.0.organization str = Google
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/google/gemma-3...
llama_model_loader: - kv 11: general.tags arr[str,1] = ["image-text-to-text"]
llama_model_loader: - kv 12: gemma3.context_length u32 = 131072
llama_model_loader: - kv 13: gemma3.embedding_length u32 = 3840
llama_model_loader: - kv 14: gemma3.block_count u32 = 48
llama_model_loader: - kv 15: gemma3.feed_forward_length u32 = 15360
llama_model_loader: - kv 16: gemma3.attention.head_count u32 = 16
llama_model_loader: - kv 17: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 18: gemma3.attention.key_length u32 = 256
llama_model_loader: - kv 19: gemma3.attention.value_length u32 = 256
llama_model_loader: - kv 20: gemma3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 21: gemma3.attention.sliding_window u32 = 1024
llama_model_loader: - kv 22: gemma3.attention.head_count_kv u32 = 8
llama_model_loader: - kv 23: gemma3.rope.scaling.type str = linear
llama_model_loader: - kv 24: gemma3.rope.scaling.factor f32 = 8.000000
llama_model_loader: - kv 25: tokenizer.ggml.model str = llama
llama_model_loader: - kv 26: tokenizer.ggml.pre str = default
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,262144] = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv 28: tokenizer.ggml.scores arr[f32,262144] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,262144] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 30: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 31: tokenizer.ggml.eos_token_id u32 = 1
llama_model_loader: - kv 32: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 33: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 34: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 35: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 36: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv 37: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 38: general.quantization_version u32 = 2
llama_model_loader: - kv 39: general.file_type u32 = 2
llama_model_loader: - kv 40: quantize.imatrix.file str = /models_out/gemma-3-12b-it-GGUF/googl...
llama_model_loader: - kv 41: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt
llama_model_loader: - kv 42: quantize.imatrix.entries_count i32 = 336
llama_model_loader: - kv 43: quantize.imatrix.chunks_count i32 = 129
llama_model_loader: - type f32: 289 tensors
llama_model_loader: - type q4_0: 330 tensors
llama_model_loader: - type q4_1: 6 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 6414
llm_load_vocab: token to piece cache size = 1.9446 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = gemma3
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 262144
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 3840
llm_load_print_meta: n_layer = 48
llm_load_print_meta: n_head = 16
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 256
llm_load_print_meta: n_swa = 1024
llm_load_print_meta: n_embd_head_k = 256
llm_load_print_meta: n_embd_head_v = 256
llm_load_print_meta: n_gqa = 2
llm_load_print_meta: n_embd_k_gqa = 2048
llm_load_print_meta: n_embd_v_gqa = 2048
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: f_attn_scale = 6.2e-02
llm_load_print_meta: n_ff = 15360
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 0.125
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 12B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 11.77 B
llm_load_print_meta: model size = 6.43 GiB (4.69 BPW)
llm_load_print_meta: general.name = Gemma 3 12b It
llm_load_print_meta: BOS token = 2 '<bos>'
llm_load_print_meta: EOS token = 1 '<eos>'
llm_load_print_meta: EOT token = 106 '<end_of_turn>'
llm_load_print_meta: UNK token = 3 '<unk>'
llm_load_print_meta: PAD token = 0 '<pad>'
llm_load_print_meta: LF token = 248 '<0x0A>'
llm_load_print_meta: EOG token = 1 '<eos>'
llm_load_print_meta: EOG token = 106 '<end_of_turn>'
llm_load_print_meta: max token length = 48
llm_load_tensors: offloading 48 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 49/49 layers to GPU
llm_load_tensors: SYCL0 model buffer size = 6582.82 MiB
llm_load_tensors: CPU_Mapped model buffer size = 787.50 MiB
.................................................................................
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: n_ctx_per_seq = 512
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 0.125
llama_new_context_with_model: n_ctx_per_seq (512) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
Found 1 SYCL devices:
| | | | |Max | |Max |Global |
|
| | | | |compute|Max work|sub |mem |
|
|ID| Device Type| Name|Version|units |group |group|size | Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]| Intel Arc A770M Graphics| 12.55| 512| 1024| 32| 16704M| 1.6.32413|
llama_kv_cache_init: SYCL0 KV buffer size = 192.00 MiB
llama_new_context_with_model: KV self size = 192.00 MiB, K (f16): 96.00 MiB, V (f16): 96.00 MiB
llama_new_context_with_model: SYCL_Host output buffer size = 1.00 MiB
llama_new_context_with_model: SYCL0 compute buffer size = 519.50 MiB
llama_new_context_with_model: SYCL_Host compute buffer size = 9.51 MiB
llama_new_context_with_model: graph nodes = 1975 (with bs=512), 1831 (with bs=1)
llama_new_context_with_model: graph splits = 2
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv init: initializing slots, n_slots = 1
slot init: id 0 | task -1 | new slot n_ctx_slot = 512
main: model loaded
main: chat template, built_in: 1, chat_example: '<start_of_turn>user
You are a helpful assistant
Hello<end_of_turn>
<start_of_turn>model
Hi there<end_of_turn>
<start_of_turn>user
How are you?<end_of_turn>
<start_of_turn>model
'
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv update_slots: all slots are idle
request: GET /v1/chat/completions 127.0.0.1 404
request: GET / 127.0.0.1 200
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 512, n_keep = 0, n_prompt_tokens = 20
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 20, n_tokens = 20, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 20, n_tokens = 20
slot update_slots: id 0 | task 0 | slot context shift, n_keep = 1, n_left = 510, n_discard = 255
slot update_slots: id 0 | task 0 | slot context shift, n_keep = 1, n_left = 510, n_discard = 255
srv cancel_tasks: cancel task, id_task = 0
request: POST /v1/chat/completions 127.0.0.1 200
slot release: id 0 | task 0 | stop processing: n_past = 328, truncated = 1
srv update_slots: all slots are idle
Hi Sketchfellow,
Thank you for reporting this issue. We're working on a fix and will provide updates once it's ready. In the meantime, you can try setting a larger -c as a workaround.
Hi Sketchfellow,
The fix is now available in bigdl-core-cpp 2.6.0b20250317. You can follow the instructions in this guide to download the latest ipex-llm[cpp] support and give it a try.
Thank you! The model now runs coherently after a context shift.