Decoder not generating output properly, causing infinite loop draining result from queue while running `example/test.py`
LinZong opened this issue · comments
Environment
Hardware:
❯ nvidia-smi
Thu Jan 25 01:34:45 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.04 Driver Version: 536.23 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:2B:00.0 Off | Off |
| 0% 30C P8 16W / 500W | 971MiB / 24564MiB | 1% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Operating System:
❯ uname -a
Linux Nemesiss-MSI 5.10.102.1-microsoft-standard-WSL2 #1 SMP Wed Mar 2 00:30:59 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Model: Qwen-7B-Chat (download from modelscope)
pip list:
just create a fresh conda env and execute
pip3 install -r ./deps/requirements_torch_gpu.txt
Package Version
------------------ ------------
annotated-types 0.6.0
anyio 4.2.0
certifi 2023.11.17
charset-normalizer 3.3.2
click 8.1.7
contourpy 1.2.0
cpm-kernels 1.0.11
cycler 0.12.1
dacite 1.8.1
einops 0.7.0
exceptiongroup 1.2.0
fastapi 0.108.0
filelock 3.13.1
fonttools 4.47.2
fsspec 2023.12.2
h11 0.14.0
huggingface-hub 0.20.3
idna 3.6
importlib-metadata 7.0.1
Jinja2 3.1.3
kiwisolver 1.4.5
lru-dict 1.3.0
maga_transformer 0.0.1
MarkupSafe 2.1.4
matplotlib 3.8.2
mpmath 1.3.0
networkx 3.2.1
numpy 1.24.1
packaging 23.2
pillow 10.2.0
pip 23.3.1
prettytable 3.9.0
protobuf 3.20.0
psutil 5.9.8
py-spy 0.3.14
pyarrow 15.0.0
pydantic 2.5.3
pydantic_core 2.14.6
pynvml 11.5.0
pyodps 0.11.5.post0
pyparsing 3.1.1
pystack-debugger 0.10.0
python-dateutil 2.8.2
PyYAML 6.0.1
regex 2023.12.25
requests 2.31.0
safetensors 0.4.2
sentencepiece 0.1.99
setuptools 68.2.2
six 1.16.0
sniffio 1.3.0
starlette 0.32.0.post1
sympy 1.12
thrift 0.16.0
tiktoken 0.4.0
tokenizers 0.13.3
torch 2.1.0+cu118
torchvision 0.16.0
tqdm 4.66.1
transformers 4.33.1
triton 2.1.0
typing_extensions 4.9.0
urllib3 1.26.18
uvicorn 0.21.1
wcwidth 0.2.13
wheel 0.41.2
zipp 3.17.0
And running example/test.py
inside docker container created with instructions in docs/Build.md
.
from maga_transformer.pipeline import Pipeline
from maga_transformer.model_factory import ModelFactory
if __name__ == '__main__':
model = ModelFactory.from_huggingface("/path/to/models/Qwen-7B-Chat")
pipeline = Pipeline(model, model.tokenizer)
for res in pipeline(["<|im_start|>user\nhello, what's your name<|im_end|>\n<|im_start|>assistant\n"], max_new_tokens = 100):
print(res.batch_response)
pipeline.stop()
# $ ls -1 /path/to/models/Qwen-7B-Chat
# LICENSE.md
# NOTICE.md
# README.md
# assets
# config.json
# configuration.json
# configuration_qwen.py
# generation_config.json
# modeling_qwen.py
# pytorch_model-00001-of-00008.bin
# pytorch_model-00002-of-00008.bin
# pytorch_model-00003-of-00008.bin
# pytorch_model-00004-of-00008.bin
# pytorch_model-00005-of-00008.bin
# pytorch_model-00006-of-00008.bin
# pytorch_model-00007-of-00008.bin
# pytorch_model-00008-of-00008.bin
# pytorch_model.bin.index.json
# quickstart.md
# qwen
# qwen.tiktoken
# qwen_generation_utils.py
# tokenization_qwen.py
Current Behavior
- Infinite loop in
maga_transformer/pipeline/pipeline.py#L241
eats up 1 CPU core, it seems that there is nothing could be taken from queue. - Queue producer locates in (maybe? I'm not sure)
maga_transformer/ops/gpt_ops/gpt_context_decoder.py#L89
seems being stuck, not returning any output.
My stack dump:
Expected behavior
Generate model response successfully and such response is equivalent to which generates from model official sample code with the same prompt.
Hi Lin,
Thanks for reaching out. We have found a potential problem that might cause your problem and we are working on a fix. However, we still want some more information from your case if possible. It would be appreciated if you can use cuda-gdb to provide more detailed stack trace when execution hangs.
Here's a brief instruction:
- use
ps auxww | grep example
to find the pid of main process. - run
sudo /usr/local/cuda/bin/cuda-gdb attach $PID
to attach cuda debugger. - in gdb, run
thread apply all bt
to print all stack traces, and paste that info here. - note that the output might be too long to print in screen, you might refer to https://stackoverflow.com/questions/5941158/gdb-print-to-file-instead-of-stdout to print to file. and you might also want to disable paging by run
set pagination off
.
可以基于新的commit打包:3d73ccb
(由于v0.1.2已发布,这个commit没有包含在我们提供的whl包内)
这个commit可能可以修复你的问题。
如果还没有解决,请参考上述debug方案,提供更详细的信息。
谢谢
Runs like butter, thanks!