feature: GPU/CUDA support?

Question

feature: GPU/CUDA support?

tensiondriven opened this issue a year ago · comments

Please close this if it's off-topic or ill informed.

LocalAI seems to be focused on providing an OpenAI-compatible API for models running via CPU, (llama.cpp, ggml). I was excited about this project because I want to use my local models with projects like BabyAGI, AutoGPT, LangChain etc, which typically either only support OpenAI API or support OpenAI first.

I know it would add a lot of work to support every model under the sun in CPU, CUDA, ROCM, and Triton, so not proposing that, but it seems leaving CUDA off the table is really limiting this projects usability.

Am I simply wrong, and typical pt / safetensors models will work fine with LocalAI, or is this a valid concern?

When I read about LocalAI on Github, I imagined this project was more of a "dumb adapter"; an HTTP server that would route requests to models being run inside projects like text-generation-webui or others, but I see it actually does the work to stand up the models, which is impressive.

Perhaps (either in this project, or another project) it would be useful to provide a project that presents as an HTTP API / CLI, and has a simple plugin architecture to allow multiple models with different backends/requirements to interface with it, such that this project could support a variety of models without having to suffer the integration and maintenance headaches that projects like text-generation-webui are going for?

Normen Hansen · Answer 1 · Tue Apr 25 2023 03:36:04 GMT+0800 (China Standard Time)

Plugins are hard to implement in golang so the server will probably continue to include the different backends for a while. That said, GPU support in the backends isn't impossible but as you said will probably quite specific to certain hardware.

In general, if we're looking at the new Apple hardware, the separate memory issue that requires you to decide if you want to use the GPU or the CPU(s) will probably be a thing of the past pretty soon. Unified memory is becoming more and more common and this server is supposed to be working on consumer hardware. Running on TPUs is probably best done in the cloud given the prices at the moment :)

All of this is just my impressions though.

Ettore Di Giacinto · Answer 2 · Fri Apr 28 2023 23:58:38 GMT+0800 (China Standard Time)

Thanks @tensiondriven !

This is a good question - pretty much currently depends on our backend which is ggml. Seems there is recently more movement going on (ggerganov/llama.cpp#915) and part of the computation can be offloaded to the GPU. So this likely will get in as well.

Regarding the architectural approach - good point, but then I think there are also already good projects like https://github.com/hyperonym/basaran which are more oriented towards running on GPU, so there could be an overlap. However, I like the idea and I'd be open to support external plugins to delegate inference.

Ettore Di Giacinto · Answer 3 · Tue May 09 2023 18:27:42 GMT+0800 (China Standard Time)

this is now possible to wire up with llama.cpp - technically should be just a matter of exposing options to our Makefile and maybe prepare nvidia images - however this is low prio here as I don't have a nvidia GPU and can't test this at all - happy if someone can jump and take a stab at this one

Ettore Di Giacinto · Answer 4 · Sun May 14 2023 16:05:18 GMT+0800 (China Standard Time)

I'll go by enabling it in the llama backend first, I'll try to take a stab at it next week

Ettore Di Giacinto · Answer 5 · Wed Jun 14 2023 23:46:18 GMT+0800 (China Standard Time)

GPU acceleration is available and Metal support too. Full GPU offloading is being added into llama.cpp in ggerganov/llama.cpp#1827 , so that means that as soon as it gets merged I will follow up here as well. I'd close this card for now, as all the pieces are already in place. We also have specific CUDA container images ready to use.

Bodhi · Answer 6 · Thu Aug 24 2023 18:17:53 GMT+0800 (China Standard Time)

Hello,
It seems still not using the Metal/GPU at all on Mac/M1 with BUILD_TYPE=metal:

After building the LocalAI on my Mac/M1 with master branch:

make clean && make BUILD_TYPE=metal build

Build is successful with local-ai generated.

models/gpt-3.5-turbo.yaml:

name: gpt-3.5-turbo
parameters:
  # this is the model downloaded from huggingface:
  model: Chinese-Llama-2-7b.ggmlv3.q4_0.bin 
  top_p: 80
  top_k: 0.9
  temperature: 0.1
context_size: 1024

Then run:

./local-ai --models-path ./models/

And send a request to http://localhost:8080/v1/chat/completions:

{
    "model": "gpt-3.5-turbo",
    "messages": [{"role": "user", "content": "How's the weather today?"}],
    "temperature": 0.9 
}

You can see it's not using GPU:

Ronaldo Talison · Answer 7 · Thu Aug 31 2023 18:00:53 GMT+0800 (China Standard Time)

Hello, It seems still not using the Metal/GPU at all on Mac/M1 with BUILD_TYPE=metal:

After building the LocalAI on my Mac/M1 with master branch:
make clean && make BUILD_TYPE=metal build
Build is successful with local-ai generated.

models/gpt-3.5-turbo.yaml:
name: gpt-3.5-turbo
parameters:
  # this is the model downloaded from huggingface:
  model: Chinese-Llama-2-7b.ggmlv3.q4_0.bin 
  top_p: 80
  top_k: 0.9
  temperature: 0.1
context_size: 1024
Then run:
./local-ai --models-path ./models/
And send a request to http://localhost:8080/v1/chat/completions:
{
    "model": "gpt-3.5-turbo",
    "messages": [{"role": "user", "content": "How's the weather today?"}],
    "temperature": 0.9 
}
You can see it's not using GPU:

I was about to say this. I've build with make BUILD_TYPE=clblas build and tryed both gpt4all-j and vicuna but both ran in cpu instead of my amd gpu.

Huy Truong · Answer 8 · Mon Sep 25 2023 21:55:05 GMT+0800 (China Standard Time)

Hello, It seems still not using the Metal/GPU at all on Mac/M1 with BUILD_TYPE=metal:
After building the LocalAI on my Mac/M1 with master branch:
make clean && make BUILD_TYPE=metal build
Build is successful with local-ai generated.
models/gpt-3.5-turbo.yaml:
name: gpt-3.5-turbo
parameters:
  # this is the model downloaded from huggingface:
  model: Chinese-Llama-2-7b.ggmlv3.q4_0.bin 
  top_p: 80
  top_k: 0.9
  temperature: 0.1
context_size: 1024
Then run:
./local-ai --models-path ./models/
And send a request to http://localhost:8080/v1/chat/completions:
{
    "model": "gpt-3.5-turbo",
    "messages": [{"role": "user", "content": "How's the weather today?"}],
    "temperature": 0.9 
}
You can see it's not using GPU:
I was about to say this. I've build with make BUILD_TYPE=clblas build and tryed both gpt4all-j and vicuna but both ran in cpu instead of my amd gpu.

DO you have solution for that ? I'm experiencing same situation. CUDA alway stay 0% when call chat from AI.