lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[BUG] RuntimeError: NPU out of memory. Tried to allocate 268.00 MiB

WangxuP opened this issue · comments

python3 -m fastchat.serve.cli --model-path /home/models/Qwen1.5-32B-Chat --device npu --gpus 0,1,2,3

(fast_chat) [root@localhost ~]# python3 -m fastchat.serve.cli --model-path /home/models/Qwen1.5-32B-Chat --device npu --gpus 0,1,2,3
/root/miniconda3/envs/fast_chat/lib/python3.8/site-packages/torch_npu/dynamo/__init__.py:18: UserWarning: Register eager implementation for the 'npu' backend of dynamo, as torch_npu was not compiled with torchair.
  warnings.warn(
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 17/17 [00:08<00:00,  1.98it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[W OptionsManager.cpp:64] Warning: The environment variable ACL_DUMP_DATA has been deprecated, please use torch_npu.npu.init_dump() instead (function operator())
Traceback (most recent call last):
  File "/root/miniconda3/envs/fast_chat/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/envs/fast_chat/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/miniconda3/envs/fast_chat/lib/python3.8/site-packages/fastchat/serve/cli.py", line 305, in <module>
    main(args)
  File "/root/miniconda3/envs/fast_chat/lib/python3.8/site-packages/fastchat/serve/cli.py", line 228, in main
    chat_loop(
  File "/root/miniconda3/envs/fast_chat/lib/python3.8/site-packages/fastchat/serve/inference.py", line 361, in chat_loop
    model, tokenizer = load_model(
  File "/root/miniconda3/envs/fast_chat/lib/python3.8/site-packages/fastchat/model/model_adapter.py", line 367, in load_model
    model.to(device)
  File "/root/miniconda3/envs/fast_chat/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2597, in to
    return super().to(*args, **kwargs)
  File "/root/miniconda3/envs/fast_chat/lib/python3.8/site-packages/torch_npu/utils/module.py", line 68, in to
    return self._apply(convert)
  File "/root/miniconda3/envs/fast_chat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  File "/root/miniconda3/envs/fast_chat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  File "/root/miniconda3/envs/fast_chat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  [Previous line repeated 2 more times]
  File "/root/miniconda3/envs/fast_chat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 833, in _apply
    param_applied = fn(param)
  File "/root/miniconda3/envs/fast_chat/lib/python3.8/site-packages/torch_npu/utils/module.py", line 66, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: NPU out of memory. Tried to allocate 268.00 MiB (NPU 0; 60.97 GiB total capacity; 59.94 GiB already allocated; 59.94 GiB current active; 18.66 MiB free; 60.69 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
(fast_chat) [root@localhost ~]#

When I was using NPU for inference, I used multi card resources for inference. However, it seems that the multi card resources did not take effect, and the error message is as shown above. Please see what caused it.

This problem is that torch_npu don't support multi-card communication within a single process. And this will be fixed in the official torch-npu&cann release at the end of April. FYI huggingface/accelerate#2368.
Also the device_map="auto" needed after above fixed.