lhenault / simpleAI

An easy way to host your own AI API and expose alternative models, while being compatible with "open" AI clients.

Home Page:https://pypi.org/project/simple-ai-server/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Issue with Loading MPT-7B on Titan-X GPU - Potential Device Map Solution

danielvasic opened this issue · comments

Hey @lhenault ,

First off, kudos on the Python package - it's been a real game-changer for my projects.

I've hit a bit of a snag though. I'm trying to load the MPT-7B model onto my Titan-X GPU, but it seems to be loading into RAM instead. Weirdly enough, I managed to get the Alpaca Lora 7B model loaded up, but it only works on the edit route. Any other route and I get an 'index out of bounds' exception. I'm guessing it's because Alpaca only supports the Instruct API and not the Chat or Completion APIs. So, I tried switching to MPT-7B.Chat, but no dice - can't get the model into memory.

I came across a StackOverflow discussion that suggested using a device map when loading the model. Sounds like it might do the trick, but I wanted to run it by you first.

Do you think this could solve my problem? Any advice would be super appreciated!

Best,

Hey @danielvasic, thanks for the kind words, I'm glad this project is useful to you!

From what I've seen, the MPT-7B model is using about 16GB of VRAM on the GPU (+ a few extra ones for the inputs), so a Titan X wouldn't be enough (correct me if I'm wrong but these "only" have 12GB).

In your original comment you mention having 2 Titan X available, so using them together through device_map="auto" should give you 24GB of VRAM, enough for you to load and use that model, if that works as suggested by your link. I'd give it a try if I were you, let me know how it goes.

BTW, current file models.py L80 loads the model on the first available GPU:

    ).to(device)

I believe you should comment and / or remove this part to use all of them.

Dear @lhenault ,

Thanks for Your reply and Your valuable time,
Actually I have two GPU-s, one is Titan X (with 12GB VRAM as you suggested) and the other is Quadro K2200 (with 4GB VRAM) that would be just enough to load the model, I have tried the device_map="auto"option and it seems like it is not supported for MPT model yet :-(

  File "/home/user/.local/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2685, in from_pretrained
    raise ValueError(f"{model.__class__.__name__} does not support `device_map='{device_map}'` yet.")
ValueError: MPTForCausalLM does not support `device_map='auto'` yet.

On another note, how can I use llama-7B-lora with openai Python API, i have tried this example but I get 500 server error:

import openai

# Put anything you want in `API key`
openai.api_key = 'Free the models'

# Point to your own url
openai.api_base = "http://127.0.0.1:8080"

# Do your usual things, for instance a completion query:
print(openai.Model.list())
completion = openai.Completion.create(model="llama-7B-lora", prompt="Hello everyone this is")

So for the Alpaca model you're mentioning, it is an instruct model, you should rather use Edits, not Completions with this one, make sure that the model is correctly defined in your models.toml file, and that you are using the correct name.

Dear @lhenault,

Certanly not a question for here but can I get some instructions on how to use Edits with openai Python API. The model is loaded and working fine with cURL request, and I get the response but I cannot find openai.Edit just openai.Completion and openai.ChatCompletion?

I have to admit I haven't tried this one through the official client, and examples are definitely lacking, but there is a Edit interface in it.

Thanks @lhenault,

Thanks very much I tried edits not edit, also on the side note If anyone wants to use ChatCompletion API the OpenAssistant example described in Your blog post works fine. I only had to define the offload_folder parameter and create the directory for it.