Mozilla-Ocho / llamafile

Distribute and run LLMs with a single file.

Home Page:https://llamafile.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

GPU offloading doesn't seem to be working

v4u6h4n opened this issue · comments

Hey everyone, awesome project :-) am having fun playing around with it, but I think my GPU isn't being utilised. I can see my CPU maxing out, and not seeing much of a change in my GPU usage, just wondering what the issue is. Here's the output in terminal:

/media/storage/Software/AI/Meta-Llama-3-70B-Instruct.Q4_0.llamafile -ngl 9999
import_cuda_impl: initializing gpu module...
get_rocm_bin_path: note: amdclang++ not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/amdclang++ does not exist
get_rocm_bin_path: note: /opt/rocm/bin/amdclang++ does not exist
get_rocm_bin_path: note: hipInfo not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/hipInfo does not exist
get_rocm_bin_path: note: /opt/rocm/bin/hipInfo does not exist
get_rocm_bin_path: note: rocminfo not found on $PATH
get_rocm_bin_path: note: $HIP_PATH/bin/rocminfo does not exist
get_rocm_bin_path: note: /opt/rocm/bin/rocminfo does not exist
get_amd_offload_arch_flag: warning: can't find hipInfo/rocminfo commands for AMD GPU detection
llamafile_log_command: hipcc -O3 -fPIC -shared -DNDEBUG --offload-arch=native -march=native -mtune=native -DGGML_BUILD=1 -DGGML_SHARED=1 -Wno-return-type -Wno-unused-result -DGGML_USE_HIPBLAS -DGGML_CUDA_MMV_Y=1 -DGGML_MULTIPLATFORM -DGGML_CUDA_DMMV_X=32 -DIGNORE4 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DIGNORE -o /home/v4u6h4n/.llamafile/ggml-rocm.so.dhsn3g /home/v4u6h4n/.llamafile/ggml-cuda.cu -lhipblas -lrocblas
hipcc: Permission denied
extract_cuda_dso: note: prebuilt binary /zip/ggml-rocm.so not found
get_nvcc_path: note: nvcc not found on $PATH
get_nvcc_path: note: $CUDA_PATH/bin/nvcc does not exist
get_nvcc_path: note: /opt/cuda/bin/nvcc does not exist
get_nvcc_path: note: /usr/local/cuda/bin/nvcc does not exist
extract_cuda_dso: note: prebuilt binary /zip/ggml-cuda.so not found
{"function":"server_params_parse","level":"WARN","line":2384,"msg":"Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support","n_gpu_layers":-1,"tid":"8545344","timestamp":1714335027}
note: if you have an AMD or NVIDIA GPU then you need to pass -ngl 9999 to enable GPU offloading
{"build":1500,"commit":"a30b324","function":"server_cli","level":"INFO","line":2839,"msg":"build info","tid":"8545344","timestamp":1714335027}
{"function":"server_cli","level":"INFO","line":2842,"msg":"system info","n_threads":16,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | ","tid":"8545344","timestamp":1714335027,"total_threads":32}
llama_model_loader: loaded meta data with 22 key-value pairs and 723 tensors from Meta-Llama-3-70B-Instruct.Q4_0.gguf (version GGUF V3 (latest))

...and my system specs:

OS: Arch Linux x86_64
Kernel: 6.8.7-arch1-2
CPU: AMD Ryzen 9 7950X3D (32) @ 5.759GHz
GPU: AMD ATI Radeon RX 7900 XT/7900 XTX/7900M
GPU: AMD ATI 13:00.0 Raphael
Memory: 14430MiB / 63427MiB

Same here, Radeon Pro W5700

llava-v1.5-7b-q4.llamafile --version
llamafile v0.8.0

Doesn't seem to have, but I'm not sure that it install properly.

I was able to make it work by changing the base image of my container to FROM nvcr.io/nvidia/pytorch:24.03-py3

That base image is gigantic (~14.6 GB), so probably the best option would be to use docker multi stage build to extract nvcc and its dependencies.

@fcrisciani Unfortunately I am enough of an amateur linux user that I don't know what that means lol but happy you got it working ;-)

I was referring to creating a docker image (https://docs.docker.com/engine/install/)

My Dockerfile looks like:

FROM nvcr.io/nvidia/pytorch:24.03-py3

RUN apt update && apt install -y wget

COPY start.sh /
RUN chmod +x /start.sh

CMD /start.sh

the start file looks like:

#!/bin/bash

echo "Download llamafile..."
wget https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llava-v1.5-7b-q4.llamafile?download=true -O /tmp/llava-v1.5-7b-q4.llamafile

echo "Start serving the llamafile"
chmod +x /tmp/llava-v1.5-7b-q4.llamafile
/tmp/llava-v1.5-7b-q4.llamafile -ngl 999 --gpu nvidia --nobrowser --host 0.0.0.0

you can:

  1. install docker
  2. create a folder with the 2 files above: Dockerfile and start.sh
  3. build the container image: docker build -t my_gpu_test .
  4. run it: docker run --rm -it --gpus=all my_gpu_test