CudaMalloc failed: out of memory with TinyLlama-1.1B
Lathanao opened this issue · comments
I am trying to make working with GPU Tinyllama with:
./TinyLlama-1.1B-Chat-v1.0.F32.llamafile -ngl 9999
But it seem not possible to allocate 66.50 MB of memory on my card, even if I just boot the machine without any use of the GPU before.
Here the error:
[...]
link_cuda_dso: note: dynamically linking /home/yo/.llamafile/ggml-cuda.so
ggml_cuda_link: welcome to CUDA SDK with cuBLAS
link_cuda_dso: GPU support loaded
[...]
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1050 Ti, compute capability 6.1, VMM: yes
llm_load_tensors: ggml ctx size = 0.15 MiB
llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 23/23 layers to GPU
llm_load_tensors: CPU buffer size = 250.00 MiB
llm_load_tensors: CUDA0 buffer size = 3946.35 MiB
..........................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 11.00 MiB
llama_new_context_with_model: KV self size = 11.00 MiB, K (f16): 5.50 MiB, V (f16): 5.50 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 66.50 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 66.50 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 69730304
llama_new_context_with_model: failed to allocate compute buffers
llama_init_from_gpt_params: error: failed to create context with model 'TinyLlama-1.1B-Chat-v1.0.F32.gguf'
{"function":"load_model","level":"ERR","line":443,"model":"TinyLlama-1.1B-Chat-v1.0.F32.gguf","msg":"unable to load model","tid":"8545344","timestamp":1714117560}
I have the cuda in this version:
Version : 12.3.2-1
Description : NVIDIA's GPU programming toolkit
Architecture : x86_64
URL : https://developer.nvidia.com/cuda-zone
Licenses : LicenseRef-NVIDIA-CUDA
Groups : None
Provides : cuda-toolkit cuda-sdk libcudart.so=12-64 libcublas.so=12-64 libcublas.so=12-64 libcusolver.so=11-64 libcusolver.so=11-64
libcusparse.so=12-64 libcusparse.so=12-64
Here the spec of my machine.
System:
Kernel: 6.6.26-1-MANJARO arch: x86_64 bits: 64 compiler: gcc v: 13.2.1
Desktop: GNOME v: 45.4 tk: GTK v: 3.24.41 Distro: Manjaro
base: Arch Linux
Machine:
Type: Laptop System: HP product: HP Pavilion Gaming Laptop 15-cx0xxx
Memory:
System RAM: total: 32 GiB available: 31.24 GiB used: 4.16 GiB (13.3%)
CPU:
Info: model: Intel Core i7-8750H bits: 64 type: MT MCP arch: Coffee Lake
gen: core 8 level: v3 note:
Graphics:
Device-2: NVIDIA GP107M [GeForce GTX 1050 Ti Mobile]
vendor: Hewlett-Packard driver: nvidia v: 550.67
alternate: nouveau,nvidia_drm non-free: 545.xx+ status: current (as of
2024-04; EOL~2026-12-xx) arch: Pascal code: GP10x process: TSMC 16nm
built: 2016-2021 pcie: gen: 1 speed: 2.5 GT/s lanes: 16 link-max: gen: 3
speed: 8 GT/s bus-ID: 01:00.0 chip-ID: 10de:1c8c class-ID: 0300
Is there a way to solve that?
Try smaller version of TinyLlama, Q8 instead of F32: TinyLlama-1.1B-Chat-v1.0.Q8_0.llamafile
Can you try llamafile-0.8.1 which was just released and tell me if it works?
Meaculpa, above, I make working a model with a lower quantization formats.
And now, I am not able to run the file again without error.
So I downloaded many models.
-Meta-Llama-3-8B-Instruct.F16.llamafile -> doeasn't load
-Meta-Llama-3-8B-Instruct.Q2_K.llamafile -> SIGSEGV
-Model/Meta-Llama-3-8B-Instruct.Q8_0.llamafile -> doeasn't load
-Model/Phi-3-mini-4k-instruct.Q8_0.llamafile -> doeasn't load
-Model/TinyLlama-1.1B-Chat-v1.0.F16.llamafile -> SIGSEGV
-Model/TinyLlama-1.1B-Chat-v1.0.F32.llamafile -> doeasn't load
-Model/TinyLlama-1.1B-Chat-v1.0.Q8_0.llamafile -> SIGSEGV
I reboot my machine and make test again. And the model what was working for me this morning (Model/TinyLlama-1.1B-Chat-v1.0.F16.llamafile), now is everytime in SIGSEGV.
No way to make it working again.
The SIGSEGV issue has been report there #378