intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, Axolotl, etc.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

starcoder2 use times out

js333031 opened this issue · comments

As the title says, using starcoder2 times out or appears stuck.

Logs attached.
starcoder2_timout.txt

Step 1.

Start the ollama server as command below:

export ONEAPI_DEVICE_SELECTOR=level_zero:1
./ollama serve

If this step could return libmkl.so related error, please turn to step 2.

Step 2.

set the environment config as below before starting ollama server:

export LD_LIBRARY_PATH=/opt/intel/oneapi/mkl/your_oneapi_version/lib:/opt/intel/oneapi/compiler/your_oneapi_version/lib

This is Windows 11 but in any case, first set of logs below is setup & initialization of the service. 2nd set of logs is output once the webui is used to write a C++ hello world code using starcoder2 model. No more console output in 2nd set of logs while the webui is waiting for result.

(llm-cpp) D:\vrt>set ONEAPI_DEVICE_SELECTOR=level_zero:1
(llm-cpp) D:\vrt\llama-cpp>set OLLAMA_NUM_GPU=999
(llm-cpp) D:\vrt\llama-cpp>set no_proxy=localhost,127.0.0.1
(llm-cpp) D:\vrt\llama-cpp>set ZES_ENABLE_SYSMAN=1
(llm-cpp) D:\vrt\llama-cpp>set SYCL_CACHE_PERSISTENT=1
(llm-cpp) D:\vrt\llama-cpp>
(llm-cpp) D:\vrt\llama-cpp>ollama serve
2024/06/06 06:48:22 routes.go:999: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR:D:\\vrt\\llama-cpp\\dist\\windows-amd64\\ollama_runners OLLAMA_TMPDIR:]"
time=2024-06-06T06:48:22.215-04:00 level=INFO source=images.go:697 msg="total blobs: 27"
time=2024-06-06T06:48:22.255-04:00 level=INFO source=images.go:704 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:   export GIN_MODE=release
 - using code:  gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2024-06-06T06:48:22.258-04:00 level=INFO source=routes.go:1044 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2024-06-06T06:48:22.259-04:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2]"
time=2024-06-06T06:48:22.259-04:00 level=INFO source=gpu.go:122 msg="Detecting GPUs"
time=2024-06-06T06:48:22.276-04:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"

[GIN] 2024/06/06 - 06:51:43 | 200 |            0s |       127.0.0.1 | GET      "/api/version"
[GIN] 2024/06/06 - 06:51:56 | 200 |     44.4295ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2024/06/06 - 06:51:58 | 200 |       995.4µs |       127.0.0.1 | GET      "/api/version"
[GIN] 2024/06/06 - 06:51:58 | 200 |      4.1466ms |       127.0.0.1 | GET      "/api/tags"
time=2024-06-06T06:52:18.117-04:00 level=INFO source=gpu.go:122 msg="Detecting GPUs"
time=2024-06-06T06:52:18.135-04:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-06-06T06:52:18.897-04:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-06-06T06:52:18.916-04:00 level=INFO source=server.go:327 msg="starting llama server" cmd="D:\\vrt\\llama-cpp\\dist\\windows-amd64\\ollama_runners\\cpu_avx2\\ollama_llama_server.exe --model C:\\Users\\user\\.ollama\\models\\blobs\\sha256-28bfdfaeba9f51611c00ed322ba684ce6db076756dbc46643f98a8a748c5199e --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 999 --parallel 1 --port 59588"
time=2024-06-06T06:52:18.924-04:00 level=INFO source=sched.go:326 msg="loaded runners" count=1
time=2024-06-06T06:52:18.924-04:00 level=INFO source=server.go:495 msg="waiting for llama runner to start responding"
{"function":"server_params_parse","level":"INFO","line":2604,"msg":"logging to file is disabled.","tid":"11668","timestamp":1717671138}
{"build":1,"commit":"baa5868","function":"wmain","level":"INFO","line":2821,"msg":"build info","tid":"11668","timestamp":1717671138}
{"function":"wmain","level":"INFO","line":2828,"msg":"system info","n_threads":10,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"11668","timestamp":1717671138,"total_threads":20}
llama_model_loader: loaded meta data with 19 key-value pairs and 483 tensors from C:\Users\user\.ollama\models\blobs\sha256-28bfdfaeba9f51611c00ed322ba684ce6db076756dbc46643f98a8a748c5199e (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = starcoder2
llama_model_loader: - kv   1:                               general.name str              = starcoder2-3b
llama_model_loader: - kv   2:                     starcoder2.block_count u32              = 30
llama_model_loader: - kv   3:                  starcoder2.context_length u32              = 16384
llama_model_loader: - kv   4:                starcoder2.embedding_length u32              = 3072
llama_model_loader: - kv   5:             starcoder2.feed_forward_length u32              = 12288
llama_model_loader: - kv   6:            starcoder2.attention.head_count u32              = 24
llama_model_loader: - kv   7:         starcoder2.attention.head_count_kv u32              = 2
llama_model_loader: - kv   8:                  starcoder2.rope.freq_base f32              = 999999.437500
llama_model_loader: - kv   9:    starcoder2.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,49152]   = ["<|endoftext|>", "<fim_prefix>", "<f...
llama_model_loader: - kv  13:                  tokenizer.ggml.token_type arr[i32,49152]   = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  14:                      tokenizer.ggml.merges arr[str,48872]   = ["Ġ Ġ", "ĠĠ ĠĠ", "ĠĠĠĠ ĠĠ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 0
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  302 tensors
llama_model_loader: - type q4_0:  181 tensors
llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:
llm_load_vocab: ************************************
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!
llm_load_vocab: CONSIDER REGENERATING THE MODEL
llm_load_vocab: ************************************
llm_load_vocab:
llm_load_vocab: special tokens definition check successful ( 38/49152 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = starcoder2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 49152
llm_load_print_meta: n_merges         = 48872
llm_load_print_meta: n_ctx_train      = 16384
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 2
llm_load_print_meta: n_layer          = 30
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 12
llm_load_print_meta: n_embd_k_gqa     = 256
llm_load_print_meta: n_embd_v_gqa     = 256
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 12288
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 999999.4
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 16384
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 3.03 B
llm_load_print_meta: model size       = 1.59 GiB (4.51 BPW)
llm_load_print_meta: general.name     = starcoder2-3b
llm_load_print_meta: BOS token        = 0 '<|endoftext|>'
llm_load_print_meta: EOS token        = 0 '<|endoftext|>'
llm_load_print_meta: UNK token        = 0 '<|endoftext|>'
llm_load_print_meta: LF token         = 164 'Ä'
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|               Intel Arc A770M Graphics|    1.3|    512|    1024|   32| 16704M|            1.3.28902|
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:512
llm_load_tensors: ggml ctx size =    0.44 MiB
llm_load_tensors: offloading 30 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 31/31 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  1629.01 MiB
llm_load_tensors:        CPU buffer size =    81.00 MiB
..............................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 999999.4
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      SYCL0 KV buffer size =    60.00 MiB
llama_new_context_with_model: KV self size  =   60.00 MiB, K (f16):   30.00 MiB, V (f16):   30.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.20 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =   124.00 MiB
[1717671143] warming up the model with an empty run
llama_new_context_with_model:  SYCL_Host compute buffer size =    10.01 MiB
llama_new_context_with_model: graph nodes  = 1177
llama_new_context_with_model: graph splits = 2

It seems there is an error when running Starcoder2. We are working on resolving this issue.

hi @js333031 , we have fixed this issue, you could install our latest version of ollama through pip install --pre --upgrade ipex-llm[cpp] tmr.

With new update, it does not get stuck but looks like only CPU is used. Should GPU (A770m) work?

(llm-cpp) D:\vrt\open-webui-main\backend>ollama serve
2024/06/12 11:54:32 routes.go:1007: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST: OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS: OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR:C:\\Users\\user\\AppData\\Local\\Programs\\Ollama\\ollama_runners OLLAMA_TMPDIR:]"
time=2024-06-12T11:54:32.476-04:00 level=INFO source=images.go:729 msg="total blobs: 32"
time=2024-06-12T11:54:32.535-04:00 level=INFO source=images.go:736 msg="total unused blobs removed: 0"
time=2024-06-12T11:54:32.537-04:00 level=INFO source=routes.go:1053 msg="Listening on 127.0.0.1:11434 (version 0.1.41)"
time=2024-06-12T11:54:32.538-04:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 cuda_v11.3 rocm_v5.7 cpu]"
time=2024-06-12T11:54:32.599-04:00 level=INFO source=types.go:71 msg="inference compute" id=0 library=cpu compute="" driver=0.0 name="" total="15.6 GiB" available="7.4 GiB"
[GIN] 2024/06/12 - 11:55:24 | 200 |     76.6328ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2024/06/12 - 11:55:26 | 200 |      5.5977ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2024/06/12 - 11:55:30 | 200 |     15.4795ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2024/06/12 - 11:55:39 | 200 |     10.4935ms |       127.0.0.1 | GET      "/api/version"
[GIN] 2024/06/12 - 11:55:42 | 200 |       245.9µs |       127.0.0.1 | GET      "/api/version"
time=2024-06-12T11:56:02.837-04:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=31 memory.available="6.5 GiB" memory.required.full="1.8 GiB" memory.required.partial="1.8 GiB" memory.required.kv="60.0 MiB" memory.weights.total="1.6 GiB" memory.weights.repeating="1.5 GiB" memory.weights.nonrepeating="81.0 MiB" memory.graph.full="120.0 MiB" memory.graph.partial="120.0 MiB"
time=2024-06-12T11:56:02.855-04:00 level=INFO source=server.go:341 msg="starting llama server" cmd="C:\\Users\\user\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cpu_avx2\\ollama_llama_server.exe --model C:\\Users\\user\\.ollama\\models\\blobs\\sha256-28bfdfaeba9f51611c00ed322ba684ce6db076756dbc46643f98a8a748c5199e --ctx-size 2048 --batch-size 512 --embedding --log-disable --parallel 1 --port 53585"
time=2024-06-12T11:56:02.907-04:00 level=INFO source=sched.go:338 msg="loaded runners" count=1
time=2024-06-12T11:56:02.907-04:00 level=INFO source=server.go:529 msg="waiting for llama runner to start responding"
time=2024-06-12T11:56:02.909-04:00 level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=3051 commit="5921b8f0" tid="8004" timestamp=1718207762
INFO [wmain] system info | n_threads=10 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="8004" timestamp=1718207762 total_threads=20
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="19" port="53585" tid="8004" timestamp=1718207762
llama_model_loader: loaded meta data with 19 key-value pairs and 483 tensors from C:\Users\user\.ollama\models\blobs\sha256-28bfdfaeba9f51611c00ed322ba684ce6db076756dbc46643f98a8a748c5199e (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = starcoder2
llama_model_loader: - kv   1:                               general.name str              = starcoder2-3b
llama_model_loader: - kv   2:                     starcoder2.block_count u32              = 30
llama_model_loader: - kv   3:                  starcoder2.context_length u32              = 16384
llama_model_loader: - kv   4:                starcoder2.embedding_length u32              = 3072
llama_model_loader: - kv   5:             starcoder2.feed_forward_length u32              = 12288
llama_model_loader: - kv   6:            starcoder2.attention.head_count u32              = 24
llama_model_loader: - kv   7:         starcoder2.attention.head_count_kv u32              = 2
llama_model_loader: - kv   8:                  starcoder2.rope.freq_base f32              = 999999.437500
llama_model_loader: - kv   9:    starcoder2.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,49152]   = ["<|endoftext|>", "<fim_prefix>", "<f...
llama_model_loader: - kv  13:                  tokenizer.ggml.token_type arr[i32,49152]   = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  14:                      tokenizer.ggml.merges arr[str,48872]   = ["Ġ Ġ", "ĠĠ ĠĠ", "ĠĠĠĠ ĠĠ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 0
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  302 tensors
llama_model_loader: - type q4_0:  181 tensors
llm_load_vocab: missing or unrecognized pre-tokenizer type, using: 'default'
llm_load_vocab: special tokens cache size = 38
time=2024-06-12T11:56:03.173-04:00 level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: token to piece cache size = 0.5651 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = starcoder2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 49152
llm_load_print_meta: n_merges         = 48872
llm_load_print_meta: n_ctx_train      = 16384
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_head           = 24
llm_load_print_meta: n_head_kv        = 2
llm_load_print_meta: n_layer          = 30
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 12
llm_load_print_meta: n_embd_k_gqa     = 256
llm_load_print_meta: n_embd_v_gqa     = 256
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 12288
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 999999.4
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 16384
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 3.03 B
llm_load_print_meta: model size       = 1.59 GiB (4.51 BPW)
llm_load_print_meta: general.name     = starcoder2-3b
llm_load_print_meta: BOS token        = 0 '<|endoftext|>'
llm_load_print_meta: EOS token        = 0 '<|endoftext|>'
llm_load_print_meta: UNK token        = 0 '<|endoftext|>'
llm_load_print_meta: LF token         = 164 'Ä'
llm_load_print_meta: EOT token        = 0 '<|endoftext|>'
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors:        CPU buffer size =  1629.01 MiB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 999999.4
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =    60.00 MiB
llama_new_context_with_model: KV self size  =   60.00 MiB, K (f16):   30.00 MiB, V (f16):   30.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.20 MiB
llama_new_context_with_model:        CPU compute buffer size =   124.01 MiB
llama_new_context_with_model: graph nodes  = 1147
llama_new_context_with_model: graph splits = 1
INFO [wmain] model loaded | tid="8004" timestamp=1718207764
time=2024-06-12T11:56:04.766-04:00 level=INFO source=server.go:572 msg="llama runner started in 1.86 seconds"
[GIN] 2024/06/12 - 11:57:55 | 200 |         1m53s |       127.0.0.1 | POST     "/api/chat"

With new update, it does not get stuck but looks like only CPU is used. Should GPU (A770m) work?

Yes A770m should work, but please install the latest update of ipex-llm[cpp] today (2.1.0b20240612 version).

I updated and tried with starcoder2. Inference happens on GPU now. But the query results start ok and soon start generating garbage output.

image

Hi @js333031 , we found that starcoder2 model produces garbage output due to poor support for the ollama official template. We haven't found a better template in the community so far. Is it possible for you to run codeqwen instead?