Import a model:latest aborted (core dumped)

Question

Import a model:latest aborted (core dumped)

Anorid opened this issue a month ago · comments

What is the issue?

I carefully read the documentation content of the README to try

root@autodl-container-36e51198ae-c4ed76b0:/autodl-tmp/model# ollama create example -f Modelfile
transferring model data
using existing layer sha256:8c7d76a23837d1b07ca3c3aa497d90ffafdfc2fd417b93e4e06caeeabf4f1526
using existing layer sha256:dbc2ca980bfce0b44450f42033a51513616ac71f8b5881efbaa81d8f5e9b253e
using existing layer sha256:be7c61fea675f5a89b441192e604c0fcc8806a19e235421f17dda66e5fc67b2d
writing manifest
success
root@autodl-container-36e51198ae-c4ed76b0:/autodl-tmp/model# ollama run example "What is your favourite condiment?"
Error: llama runner process has terminated: signal: aborted (core dumped)
root@autodl-container-36e51198ae-c4ed76b0:/autodl-tmp/model# nivdia-smi
bash: nivdia-smi: command not found
root@autodl-container-36e51198ae-c4ed76b0:/autodl-tmp/model# nvidia-smi
Fri May 17 10:02:03 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A40 On | 00000000:C1:00.0 Off | Off |
| 0% 25C P8 20W / 300W | 2MiB / 49140MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
root@autodl-container-36e51198ae-c4ed76b0:~/autodl-tmp/model# ollama run example
Error: llama runner process has terminated: signal: aborted (core dumped)

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

No response

Patrick Devine · Answer 1 · Sun May 19 2024 07:00:06 GMT+0800 (China Standard Time)

Can you post the Modelfile and the logs? What was the gguf you were using?

Arnold · Answer 2 · Mon May 20 2024 11:51:29 GMT+0800 (China Standard Time)

Import from PyTorch or Safetensors
See the guide on importing models for more information.
This is the conversion I performed in llama.cpp using convert-hf-to-gguf and llm/llama.cpp/quantize with q4_0, as described in the "Import from PyTorch or Safetensors" section. See the guide on importing models for more information.

Arnold · Answer 3 · Mon May 20 2024 11:52:14 GMT+0800 (China Standard Time)

root@autodl-container-c438119a3c-80821c25:~/autodl-tmp# ollama serve
2024/05/20 11:28:20 routes.go:1008: INFO server config env="map[OLLAMA_DEBUG:false OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:*] OLLAMA_RUNNERS_DIR: OLLAMA_TMPDIR:]"
time=2024-05-20T11:28:20.921+08:00 level=INFO source=images.go:704 msg="total blobs: 0"
time=2024-05-20T11:28:20.921+08:00 level=INFO source=images.go:711 msg="total unused blobs removed: 0"
time=2024-05-20T11:28:20.922+08:00 level=INFO source=routes.go:1054 msg="Listening on [::]:6006 (version 0.1.38)"
time=2024-05-20T11:28:20.922+08:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama2749468660/runners
time=2024-05-20T11:28:24.936+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11 rocm_v60002]"
time=2024-05-20T11:28:25.117+08:00 level=INFO source=types.go:71 msg="inference compute" id=GPU-0f3aa8d5-c5ed-3fa3-1cb4-4aef2d3d8317 library=cuda compute=8.6 driver=12.2 name="NVIDIA A40" total="47.5 GiB" available="47.3 GiB"
[GIN] 2024/05/20 - 11:32:05 | 200 | 86.076µs | 127.0.0.1 | HEAD "/"
[GIN] 2024/05/20 - 11:32:29 | 201 | 12.804363258s | 127.0.0.1 | POST "/api/blobs/sha256:1c751709783923dab2b876d5c5c2ca36d4e205cfef7d88988df45752cb91f245"
[GIN] 2024/05/20 - 11:32:43 | 200 | 14.155378431s | 127.0.0.1 | POST "/api/create"
[GIN] 2024/05/20 - 11:33:04 | 200 | 35.782µs | 127.0.0.1 | HEAD "/"
[GIN] 2024/05/20 - 11:33:04 | 200 | 1.190285ms | 127.0.0.1 | POST "/api/show"
[GIN] 2024/05/20 - 11:33:04 | 200 | 737.579µs | 127.0.0.1 | POST "/api/show"
time=2024-05-20T11:33:06.243+08:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=41 memory.available="47.3 GiB" memory.required.full="9.7 GiB" memory.required.partial="9.7 GiB" memory.required.kv="1.6 GiB" memory.weights.total="7.2 GiB" memory.weights.repeating="6.6 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="307.0 MiB" memory.graph.partial="916.1 MiB"
time=2024-05-20T11:33:06.244+08:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=41 memory.available="47.3 GiB" memory.required.full="9.7 GiB" memory.required.partial="9.7 GiB" memory.required.kv="1.6 GiB" memory.weights.total="7.2 GiB" memory.weights.repeating="6.6 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="307.0 MiB" memory.graph.partial="916.1 MiB"
time=2024-05-20T11:33:06.244+08:00 level=INFO source=server.go:320 msg="starting llama server" cmd="/tmp/ollama2749468660/runners/cuda_v11/ollama_llama_server --model /root/autodl-tmp/model/blobs/sha256-1c751709783923dab2b876d5c5c2ca36d4e205cfef7d88988df45752cb91f245 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --parallel 1 --port 39195"
time=2024-05-20T11:33:06.245+08:00 level=INFO source=sched.go:338 msg="loaded runners" count=1
time=2024-05-20T11:33:06.245+08:00 level=INFO source=server.go:504 msg="waiting for llama runner to start responding"
time=2024-05-20T11:33:06.245+08:00 level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=1 commit="952d03d" tid="140637096448000" timestamp=1716175986
INFO [main] system info | n_threads=64 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140637096448000" timestamp=1716175986 total_threads=128
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="127" port="39195" tid="140637096448000" timestamp=1716175986
llama_model_loader: loaded meta data with 21 key-value pairs and 483 tensors from /root/autodl-tmp/model/blobs/sha256-1c751709783923dab2b876d5c5c2ca36d4e205cfef7d88988df45752cb91f245 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.name str = merge5-1
llama_model_loader: - kv 2: qwen2.block_count u32 = 40
llama_model_loader: - kv 3: qwen2.context_length u32 = 32768
llama_model_loader: - kv 4: qwen2.embedding_length u32 = 5120
llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 13696
llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 40
llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 40
llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: general.file_type u32 = 2
llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151643
llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 19: tokenizer.chat_template str = {% for message in messages %}{% if lo...
llama_model_loader: - kv 20: general.quantization_version u32 = 2
llama_model_loader: - type f32: 201 tensors
llama_model_loader: - type q4_0: 281 tensors
llama_model_loader: - type q6_K: 1 tensors
time=2024-05-20T11:33:06.497+08:00 level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server loading model"
llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'qwen2'
llama_load_model_from_file: exception loading model
terminate called after throwing an instance of 'std::runtime_error'
what(): error loading model vocabulary: unknown pre-tokenizer type: 'qwen2'
time=2024-05-20T11:33:06.872+08:00 level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server error"
time=2024-05-20T11:33:07.122+08:00 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) "
[GIN] 2024/05/20 - 11:33:07 | 500 | 2.21829574s | 127.0.0.1 | POST "/api/chat"
time=2024-05-20T11:33:12.234+08:00 level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.112074522
time=2024-05-20T11:33:12.485+08:00 level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.362608222
time=2024-05-20T11:33:12.734+08:00 level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.612062447

This is an error with the logs

Arnold · Answer 4 · Mon May 20 2024 11:52:32 GMT+0800 (China Standard Time)

Arnold commented a month ago

Arnold · Answer 5 · Mon May 20 2024 14:26:24 GMT+0800 (China Standard Time)

@pdevine Can this information determine the cause of the error?

Patrick Devine · Answer 6 · Mon May 20 2024 14:54:38 GMT+0800 (China Standard Time)

Can you include the Modelfile as well?