alibaba / rtp-llm

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Decoder not generating output properly, causing infinite loop draining result from queue while running `example/test.py`

LinZong opened this issue · comments

Environment

Hardware:

❯ nvidia-smi
Thu Jan 25 01:34:45 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.04              Driver Version: 536.23       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:2B:00.0 Off |                  Off |
|  0%   30C    P8              16W / 500W |    971MiB / 24564MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Operating System:

❯ uname -a
Linux Nemesiss-MSI 5.10.102.1-microsoft-standard-WSL2 #1 SMP Wed Mar 2 00:30:59 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Model: Qwen-7B-Chat (download from modelscope)

pip list:

just create a fresh conda env and execute pip3 install -r ./deps/requirements_torch_gpu.txt

Package            Version
------------------ ------------
annotated-types    0.6.0
anyio              4.2.0
certifi            2023.11.17
charset-normalizer 3.3.2
click              8.1.7
contourpy          1.2.0
cpm-kernels        1.0.11
cycler             0.12.1
dacite             1.8.1
einops             0.7.0
exceptiongroup     1.2.0
fastapi            0.108.0
filelock           3.13.1
fonttools          4.47.2
fsspec             2023.12.2
h11                0.14.0
huggingface-hub    0.20.3
idna               3.6
importlib-metadata 7.0.1
Jinja2             3.1.3
kiwisolver         1.4.5
lru-dict           1.3.0
maga_transformer   0.0.1
MarkupSafe         2.1.4
matplotlib         3.8.2
mpmath             1.3.0
networkx           3.2.1
numpy              1.24.1
packaging          23.2
pillow             10.2.0
pip                23.3.1
prettytable        3.9.0
protobuf           3.20.0
psutil             5.9.8
py-spy             0.3.14
pyarrow            15.0.0
pydantic           2.5.3
pydantic_core      2.14.6
pynvml             11.5.0
pyodps             0.11.5.post0
pyparsing          3.1.1
pystack-debugger   0.10.0
python-dateutil    2.8.2
PyYAML             6.0.1
regex              2023.12.25
requests           2.31.0
safetensors        0.4.2
sentencepiece      0.1.99
setuptools         68.2.2
six                1.16.0
sniffio            1.3.0
starlette          0.32.0.post1
sympy              1.12
thrift             0.16.0
tiktoken           0.4.0
tokenizers         0.13.3
torch              2.1.0+cu118
torchvision        0.16.0
tqdm               4.66.1
transformers       4.33.1
triton             2.1.0
typing_extensions  4.9.0
urllib3            1.26.18
uvicorn            0.21.1
wcwidth            0.2.13
wheel              0.41.2
zipp               3.17.0

And running example/test.py inside docker container created with instructions in docs/Build.md .

from maga_transformer.pipeline import Pipeline
from maga_transformer.model_factory import ModelFactory

if __name__ == '__main__':
    model = ModelFactory.from_huggingface("/path/to/models/Qwen-7B-Chat")
    pipeline = Pipeline(model, model.tokenizer)
    for res in pipeline(["<|im_start|>user\nhello, what's your name<|im_end|>\n<|im_start|>assistant\n"], max_new_tokens = 100):
        print(res.batch_response)
    pipeline.stop()
    
  # $ ls -1 /path/to/models/Qwen-7B-Chat
  # LICENSE.md
  # NOTICE.md
  # README.md
  # assets
  # config.json
  # configuration.json
  # configuration_qwen.py
  # generation_config.json
  # modeling_qwen.py
  # pytorch_model-00001-of-00008.bin
  # pytorch_model-00002-of-00008.bin
  # pytorch_model-00003-of-00008.bin
  # pytorch_model-00004-of-00008.bin
  # pytorch_model-00005-of-00008.bin
  # pytorch_model-00006-of-00008.bin
  # pytorch_model-00007-of-00008.bin
  # pytorch_model-00008-of-00008.bin
  # pytorch_model.bin.index.json
  # quickstart.md
  # qwen
  # qwen.tiktoken
  # qwen_generation_utils.py
  # tokenization_qwen.py

Current Behavior

  1. Infinite loop in maga_transformer/pipeline/pipeline.py#L241 eats up 1 CPU core, it seems that there is nothing could be taken from queue.
  2. Queue producer locates in (maybe? I'm not sure) maga_transformer/ops/gpt_ops/gpt_context_decoder.py#L89 seems being stuck, not returning any output.

My stack dump:

image-20240125015305819

Expected behavior

Generate model response successfully and such response is equivalent to which generates from model official sample code with the same prompt.

Hi Lin,

Thanks for reaching out. We have found a potential problem that might cause your problem and we are working on a fix. However, we still want some more information from your case if possible. It would be appreciated if you can use cuda-gdb to provide more detailed stack trace when execution hangs.

Here's a brief instruction:

  1. use ps auxww | grep example to find the pid of main process.
  2. run sudo /usr/local/cuda/bin/cuda-gdb attach $PID to attach cuda debugger.
  3. in gdb, run thread apply all bt to print all stack traces, and paste that info here.
  4. note that the output might be too long to print in screen, you might refer to https://stackoverflow.com/questions/5941158/gdb-print-to-file-instead-of-stdout to print to file. and you might also want to disable paging by run set pagination off .

可以基于新的commit打包:3d73ccb
(由于v0.1.2已发布,这个commit没有包含在我们提供的whl包内)
这个commit可能可以修复你的问题。
如果还没有解决,请参考上述debug方案,提供更详细的信息。
谢谢

可以基于新的commit打包:3d73ccb (由于v0.1.2已发布,这个commit没有包含在我们提供的whl包内) 这个commit可能可以修复你的问题。 如果还没有解决,请参考上述debug方案,提供更详细的信息。 谢谢

Thanks for a quick reply! I am trying to build whl from commit: 3d73ccb and investigate what will happen.

Runs like butter, thanks!

Runs like butter, thanks!

@LinZong 你好,RTP-LLM项目近期在做捉bug送咖啡的社区活动,为了感谢你之前对项目的贡献,我们会补上咖啡一杯,小小心意~请扫码进群,添加发布群公告的运营女工为好友获取💗
image