alibaba / rtp-llm

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Qwen Chat CUDA OutOfMemory

xorange opened this issue · comments

RTX 4090 24G,
Qwen-7B-Chat

loads OK:

model_config = ModelConfig(lora_infos={
     "lora_1": conf['lora_1'],
    "lora_2": conf['lora_2'],
})
model = ModelFactory.from_huggingface(conf['base_model_dir'], model_config=model_config)
pipeline = Pipeline(model, model.tokenizer)

But the following causes OutOfMemoryError

# rtp_sys.conf
#
# [
#     {"task_id": 1, "prompt": " <|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>"}
# ]

import os
os.environ['MULTI_TASK_PROMPT'] = './rtp_sys.conf'
model_config = ModelConfig(lora_infos={
     "lora_1": conf['lora_1'],
    "lora_2": conf['lora_2'],
})
model = ModelFactory.from_huggingface(conf['base_model_dir'], model_config=model_config)
pipeline = Pipeline(model, model.tokenizer)


File "/data1/miniconda/xxx/rtp-llm/lib/python3.10/site-packages/maga_transformer/utils/model_weights_loader.py", line 304, in _load_layer_weight
    tensor = self._split_and_sanitize_tensor(tensor, weight).to(device)
torch.cuda.OutOfMemoryError: CUDA out of memory.

I've tried with and without export ENABLE_FMHA=OFF
I'm referring to this link SystemPrompt-Tutorial

For the record, my requirement here are:

  1. have 2 LoRAs, and during one round chat I have to switch between them
  2. I need to use chat interface. Since Qwen does not come with chat_template, I need a way to implement "make_context"

Because of requirement 1, python3 -m maga_transformer.start_server + http post with OpenAI request is not the case. (Or if you could switch different adapter for a up running server, please tell me)

Hi there,
Usually CUDA OOM is an expected behaviour, it seems that in your setup this is possible.
Maybe you can try using int8 quantization, which saves a lot of cuda memory.

Hi there, Usually CUDA OOM is an expected behaviour, it seems that in your setup this is possible. Maybe you can try using int8 quantization, which saves a lot of cuda memory.

I'm not sure why that rtp-llm loads this model successfully, but then fail when provided with a chat template.

I did not even start to chat.