Qwen Chat CUDA OutOfMemory

Question

Qwen Chat CUDA OutOfMemory

xorange opened this issue 3 months ago · comments

RTX 4090 24G,
Qwen-7B-Chat

loads OK:

model_config = ModelConfig(lora_infos={
     "lora_1": conf['lora_1'],
    "lora_2": conf['lora_2'],
})
model = ModelFactory.from_huggingface(conf['base_model_dir'], model_config=model_config)
pipeline = Pipeline(model, model.tokenizer)

But the following causes OutOfMemoryError

# rtp_sys.conf
#
# [
#     {"task_id": 1, "prompt": " <|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>"}
# ]

import os
os.environ['MULTI_TASK_PROMPT'] = './rtp_sys.conf'
model_config = ModelConfig(lora_infos={
     "lora_1": conf['lora_1'],
    "lora_2": conf['lora_2'],
})
model = ModelFactory.from_huggingface(conf['base_model_dir'], model_config=model_config)
pipeline = Pipeline(model, model.tokenizer)


File "/data1/miniconda/xxx/rtp-llm/lib/python3.10/site-packages/maga_transformer/utils/model_weights_loader.py", line 304, in _load_layer_weight
    tensor = self._split_and_sanitize_tensor(tensor, weight).to(device)
torch.cuda.OutOfMemoryError: CUDA out of memory.

I've tried with and without export ENABLE_FMHA=OFF
I'm referring to this link SystemPrompt-Tutorial

For the record, my requirement here are:

have 2 LoRAs, and during one round chat I have to switch between them
I need to use chat interface. Since Qwen does not come with chat_template, I need a way to implement "make_context"

Because of requirement 1, python3 -m maga_transformer.start_server + http post with OpenAI request is not the case. (Or if you could switch different adapter for a up running server, please tell me)

Wangyin Yang · Answer 1 · Fri May 31 2024 17:47:54 GMT+0800 (China Standard Time)

Hi there,
Usually CUDA OOM is an expected behaviour, it seems that in your setup this is possible.
Maybe you can try using int8 quantization, which saves a lot of cuda memory.

Oliver Xu · Answer 2 · Fri May 31 2024 18:16:09 GMT+0800 (China Standard Time)

Hi there, Usually CUDA OOM is an expected behaviour, it seems that in your setup this is possible. Maybe you can try using int8 quantization, which saves a lot of cuda memory.

I'm not sure why that rtp-llm loads this model successfully, but then fail when provided with a chat template.

I did not even start to chat.