[Performance]: Qwen 7b chat model, under 128 concurrency, the CPU utilization rate is 100%, and the GPU SM utilization rate is only about 60%-75%. Is it a CPU bottleneck?

Question

[Performance]: Qwen 7b chat model, under 128 concurrency, the CPU utilization rate is 100%, and the GPU SM utilization rate is only about 60%-75%. Is it a CPU bottleneck?

markluofd opened this issue a month ago · comments

markluofd commented a month ago

Proposal to improve performance

No response

Report of performance regression

No response

Misc discussion on performance

I am using vllm to deploy the qwen 7b chat model service. In a very high concurrency scenario, such as 128 concurrency, I found that the CPU utilization reached 100%, but I saw the GPU utilization rate is less than 60%

My question is, because a lot of vllm's scheduling and calculation logic is implemented by Python coroutines, it can only use the computing power of a single CPU. In a scenario like this with 128 concurrency, is the CPU becoming a computing bottleneck, causing GPU CUDA to be unable to achieve higher performance?

Model download address：https://huggingface.co/Qwen/Qwen-7B-Chat/tree/main

For sever scenario
For offline batch inference scenario

import random
import json
from vllm import LLM, SamplingParams

conc = 128
jsonl_path = "xxx.jsonl"

# 从jsonl文件中读取concurrent条数据
all_prompts = []
with open(jsonl_path, "r") as f:
    for line in f:
        line_obj = json.loads(line)
        print("line_obj as: ", line_obj)
        try:
            prompt = line_obj[-1]["content"]
        except Exception as e:
            prompt = line_obj[-1]["Content"]

        all_prompts.append(prompt)

# Sample prompts.
if len(all_prompts) > conc:
    prompts = all_prompts[:conc]
else:
    prompts = random.choices(all_prompts, k=conc)

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=500)

# Create an LLM.
#llm = LLM(model="facebook/opt-125m")
# llama2 7b chat
llm = LLM(model="/models/models--Qwen--Qwen-7B-Chat-new", trust_remote_code=True)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Your current environment (if you think it is necessary)

Collecting environment information...
PyTorch version: 2.2.2+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.27

Python version: 3.9.16 (main, May 15 2023, 23:46:34)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-3.10.0-1160.83.1.el7.x86_64-x86_64-with-glibc2.27
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB
Nvidia driver version: 535.154.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.6.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              14
On-line CPU(s) list: 0-13
Thread(s) per core:  2
Core(s) per socket:  7
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               106
Model name:          Intel(R) Xeon(R) Platinum 8350C CPU @ 2.60GHz
Stepping:            6
CPU MHz:             2593.904
BogoMIPS:            5187.80
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           48K
L1i cache:           32K
L2 cache:            1280K
L3 cache:            49152K
NUMA node0 CPU(s):   0-13
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq spec_ctrl

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu11==2.19.3
[pip3] nvidia-nccl-cu12==2.19.3
[pip3] torch==2.2.2+cu118
[pip3] triton==2.2.0
[pip3] vllm-nccl-cu11==2.18.1.0.4.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-nccl-cu11          2.19.3                   pypi_0    pypi
[conda] nvidia-nccl-cu12          2.19.3                   pypi_0    pypi
[conda] torch                     2.2.2+cu118              pypi_0    pypi
[conda] triton                    2.2.0                    pypi_0    pypi
[conda] vllm-nccl-cu11            2.18.1.0.4.0             pypi_0    pypiROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0	NIC0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	SYS	0-13	0		N/A
NIC0	SYS	 X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0

Vectory · Answer 1 · Wed May 15 2024 18:33:40 GMT+0800 (China Standard Time)

You can start the VLLM API interface service, which will have CPU and GPU utilization, for example
Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
KV Cache will occupy GPU first, then CPU, can use FP8 E4M3 KV Cache reduce KV Cache utilization