intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, Axolotl, etc.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

phi3 medium - garbage output in webui or generated by ollama

js333031 opened this issue · comments

In attached, please find output of webui and ollama server console. At line 1 of webui output, I ask the question, using llama3:latest (line 3). Result is shown in lines 4-42

At line 45, I ask same question but using phi3:medium (line 47). Output follows and it's garbage.

 ipex-llm-ollama-server.txt
webui_output.txt

hi @js333031, we are working on reproducing your issue, could you also please try the following solutions?

  1. Pull phi-3 model by running command ollama pull phi3?
  2. Add prompt template in your ollama modelfile to create the ollama phi-3 model as below:
    FROM ./Phi-3-medium-4k-instruct.gguf
    TEMPLATE """<|user|>
    {{.Prompt}}<|end|>
    <|assistant|>"""
    PARAMETER stop <|end|>
    PARAMETER num_ctx 4096
    PARAMETER num_gpu 33

I tried that via the webui GUI. I get an error when I click on save & create button: **
image
**

hi @js333031 , could you please try the following methods:

  1. In your teminal, run ollama pull phi3 to download the model.
  2. If the first solution does not work, please use ollama modelfile #11177 (comment) to create the ollama phi3 model from gguf file. more details.

You may see https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/ollama_quickstart.html#using-ollama-run-gguf-models for the detailed steps.

Hi @js333031 , I apologize for my mistake, ollama pull phi3 will only pull phi-3-mini model. There is indeed an abnormal output issue with phi-3-medium. You may refer to the modelfile below as a workaround to avoid the abnormal output issue when using gguf to create your ollama model.

FROM Phi-3-medium-4k-instruct-Q4_K_S.gguf

TEMPLATE """{{ if .System }}<|system|>
{{ .System }}<|end|>
{{ end }}{{ if .Prompt }}<|user|>
{{ .Prompt }}<|end|>
{{ end }}<|assistant|>
{{ .Response }}<|end|>"""

PARAMETER stop "<|user|>"
PARAMETER stop "<|assistant|>"
PARAMETER stop "<|end|>"
PARAMETER num_ctx 256
PARAMETER num_gpu 33

Once we resolve the abnormal output issue, will inform you immediately.

Hi @sgwhat , i got the same problem.

And took your comments and the results are the same.

I took the model: Phi-3-medium-4k-instruct-Q5_K_M.gguf, and used your template:

FROM /llm/models/Phi-3-medium-4k-instruct-Q5_K_M.gguf

TEMPLATE """{{ if .System }}<|system|>
{{ .System }}<|end|>
{{ end }}{{ if .Prompt }}<|user|>
{{ .Prompt }}<|end|>
{{ end }}<|assistant|>
{{ .Response }}<|end|>"""

PARAMETER stop "<|user|>"
PARAMETER stop "<|assistant|>"
PARAMETER stop "<|end|>"
PARAMETER num_ctx 256
PARAMETER num_gpu 33

And i also get garbage:
image

So i saw that you made added another auto-tokenizer in the recent commit(15a6205) and the one used here with phi is:
tokenizer.ggml.model str = llama

Could this be the issue?

This is my console log (i run it in the docker container)

root@ai:/llm/scripts# bash start-ollama.sh
root@ai:/llm/scripts# 2024/06/03 21:08:59 routes.go:1028: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST: OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS: OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]"
time=2024-06-03T21:08:59.788+02:00 level=INFO source=images.go:729 msg="total blobs: 0"
time=2024-06-03T21:08:59.788+02:00 level=INFO source=images.go:736 msg="total unused blobs removed: 0"
time=2024-06-03T21:08:59.788+02:00 level=INFO source=routes.go:1074 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2024-06-03T21:08:59.788+02:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama1794338369/runners
time=2024-06-03T21:08:59.842+02:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2]"
time=2024-06-03T21:08:59.844+02:00 level=INFO source=types.go:71 msg="inference compute" id=0 library=cpu compute="" driver=0.0 name="" total="62.6 GiB" available="2.6 GiB"

root@ai:/llm/scripts# bash start-open-webui.sh
Cannot determine model snapshot path: Cannot find an appropriate cached snapshot folder for the specified revision on the local disk and outgoing traffic has been disabled. To enable repo look-ups and downloads online, pass 'local_files_only=False' as input.
Traceback (most recent call last):
  File "/llm/open-webui/backend/apps/rag/utils.py", line 396, in get_model_path
    model_repo_path = snapshot_download(**snapshot_kwargs)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/huggingface_hub/_snapshot_download.py", line 220, in snapshot_download
    raise LocalEntryNotFoundError(
huggingface_hub.utils._errors.LocalEntryNotFoundError: Cannot find an appropriate cached snapshot folder for the specified revision on the local disk and outgoing traffic has been disabled. To enable repo look-ups and downloads online, pass 'local_files_only=False' as input.
modules.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 349/349 [00:00<00:00, 1.39MB/s]
config_sentence_transformers.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 116/116 [00:00<00:00, 555kB/s]
README.md: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10.7k/10.7k [00:00<00:00, 24.9MB/s]
sentence_bert_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 53.0/53.0 [00:00<00:00, 181kB/s]
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 612/612 [00:00<00:00, 3.19MB/s]
model.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 90.9M/90.9M [00:02<00:00, 44.9MB/s]
tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 350/350 [00:00<00:00, 1.66MB/s]
vocab.txt: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 232k/232k [00:00<00:00, 1.09MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 466k/466k [00:00<00:00, 1.53MB/s]
special_tokens_map.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 112/112 [00:00<00:00, 565kB/s]
1_Pooling/config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 190/190 [00:00<00:00, 876kB/s]
INFO:     Started server process [51]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
time=2024-06-03T21:15:04.659+02:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=33 layers.real=33 memory.available="1.4 GiB" memory.required.full="9.6 GiB" memory.required.partial="7.8 GiB" memory.required.kv="50.0 MiB" memory.weights.total="9.3 GiB" memory.weights.repeating="9.2 GiB" memory.weights.nonrepeating="128.4 MiB" memory.graph.full="33.3 MiB" memory.graph.partial="33.3 MiB"
time=2024-06-03T21:15:04.660+02:00 level=INFO source=server.go:342 msg="starting llama server" cmd="/tmp/ollama1794338369/runners/cpu_avx2/ollama_llama_server --model /root/.ollama/models/blobs/sha256-5e9d850d6c899e7fdf39a19cdf6fecae225e0c5bb3d13d6f277cbda508a15f0c --ctx-size 256 --batch-size 512 --embedding --log-disable --n-gpu-layers 999 --parallel 1 --port 34173"
time=2024-06-03T21:15:04.660+02:00 level=INFO source=sched.go:338 msg="loaded runners" count=1
time=2024-06-03T21:15:04.660+02:00 level=INFO source=server.go:529 msg="waiting for llama runner to start responding"
time=2024-06-03T21:15:04.663+02:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server error"
llama_model_loader: loaded meta data with 30 key-value pairs and 243 tensors from /root/.ollama/models/blobs/sha256-5e9d850d6c899e7fdf39a19cdf6fecae225e0c5bb3d13d6f277cbda508a15f0c (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi3
llama_model_loader: - kv   1:                               general.name str              = Phi3
llama_model_loader: - kv   2:                        phi3.context_length u32              = 4096
llama_model_loader: - kv   3:  phi3.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv   4:                      phi3.embedding_length u32              = 5120
llama_model_loader: - kv   5:                   phi3.feed_forward_length u32              = 17920
llama_model_loader: - kv   6:                           phi3.block_count u32              = 40
llama_model_loader: - kv   7:                  phi3.attention.head_count u32              = 40
llama_model_loader: - kv   8:               phi3.attention.head_count_kv u32              = 10
llama_model_loader: - kv   9:      phi3.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                  phi3.rope.dimension_count u32              = 128
llama_model_loader: - kv  11:                        phi3.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 17
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,32064]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,32064]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,32064]   = [3, 3, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  20:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 32000
llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {% for message in messages %}{% if (m...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - kv  26:                      quantize.imatrix.file str              = /models/Phi-3-medium-4k-instruct-GGUF...
llama_model_loader: - kv  27:                   quantize.imatrix.dataset str              = /training_data/calibration_data.txt
llama_model_loader: - kv  28:             quantize.imatrix.entries_count i32              = 160
llama_model_loader: - kv  29:              quantize.imatrix.chunks_count i32              = 234
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q5_K:  101 tensors
llama_model_loader: - type q6_K:   61 tensors
llm_load_vocab: special tokens definition check successful ( 323/32064 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi3
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32064
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 10
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1280
llm_load_print_meta: n_embd_v_gqa     = 1280
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 17920
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 14B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 13.96 B
llm_load_print_meta: model size       = 9.38 GiB (5.77 BPW)
llm_load_print_meta: general.name     = Phi3
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|endoftext|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 32000 '<|endoftext|>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOT token        = 32007 '<|end|>'
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
time=2024-06-03T21:15:04.914+02:00 level=INFO source=server.go:566 msg="waiting for server to become available" status="llm server loading model"
found 4 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|    1.3|    512|    1024|   32| 16225M|            1.3.26241|
| 1|     [opencl:gpu:0]|                Intel Arc A770 Graphics|    3.0|    512|    1024|   32| 16225M|       23.35.27191.42|
| 2|     [opencl:cpu:0]|          13th Gen Intel Core i5-13600K|    3.0|     20|    8192|   64| 67175M|2023.16.12.0.12_195853.xmain-hotfix|
| 3|     [opencl:acc:0]|            Intel FPGA Emulation Device|    1.2|     20|67108864|   64| 67175M|2023.16.12.0.12_195853.xmain-hotfix|
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:512
llm_load_tensors: ggml ctx size =    0.28 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  9499.15 MiB
llm_load_tensors:        CPU buffer size =   107.64 MiB
llama_new_context_with_model: n_ctx      = 256
llama_new_context_with_model: n_batch    = 256
llama_new_context_with_model: n_ubatch   = 256
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      SYCL0 KV buffer size =    50.00 MiB
llama_new_context_with_model: KV self size  =   50.00 MiB, K (f16):   25.00 MiB, V (f16):   25.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.14 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =    85.25 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =     5.25 MiB
llama_new_context_with_model: graph nodes  = 1646
llama_new_context_with_model: graph splits = 2
time=2024-06-03T21:15:16.223+02:00 level=INFO source=server.go:571 msg="llama runner started in 11.56 seconds"

Hi @flekol , you may try to set export OLLAMA_NUM_GPU=33 before you starting the ollama server, this is a feasible workaround. BTW, may I take a look at your start-ollama.sh?

I downloaded a different model file and used your template. Still getting garbage:

huggingface-cli download bartowski/Phi-3-medium-4k-instruct-GGUF --include "Phi-3-medium-4k-instruct-Q4_K_M.gguf" --local-dir ./

ollama create example -f ModelFile

Modelfile:

FROM Phi-3-medium-4k-instruct-Q4_K_M.gguf

TEMPLATE """{{ if .System }}<|system|>
{{ .System }}<|end|>
{{ end }}{{ if .Prompt }}<|user|>
{{ .Prompt }}<|end|>
{{ end }}<|assistant|>
{{ .Response }}<|end|>"""

PARAMETER stop "<|user|>"
PARAMETER stop "<|assistant|>"
PARAMETER stop "<|end|>"
PARAMETER num_ctx 256
PARAMETER num_gpu 33

I also tried Phi-3-medium-4k-instruct-Q4_K_S.gguf and output is garbage also.

Hi @js333031 , we have fixed the abnormal output issue and it will be released tonight, you may run the command below to install the latest version of ipex-llm[cpp] tomorrow (version 2.1.0b20240605) and initialize ollama as below:

pip instal l --pre --upgrade ipex-llm[cpp]

init-ollama

@sgwhat thanks a lot, works for me now in the docker container!

btw. which commit did fix this?

Hi @js333031 , we have fixed the abnormal output issue and it will be released tonight, you may run the command below to install the latest version of ipex-llm[cpp] tomorrow (version 2.1.0b20240605) and initialize ollama as below:

pip instal l --pre --upgrade ipex-llm[cpp]

init-ollama

I'm able to use the model now. Thanks for the fix.