[Model] DeepSeek-V3 Enhancements

Question

[Model] DeepSeek-V3 Enhancements

simon-mo opened this issue 9 months ago · comments

This issue tracks follow up enhancements after initial support for the Deepseek V3 model. Please feel free to chime in and contribute!

july8023 · Answer 1 · Mon Dec 30 2024 11:35:59 GMT+0800 (China Standard Time)

If I want to deploy deepseek 600B use vllm and RTX4090, are there any restrictions? How many RTX 4090 do I need at least?

Farid Saud · Answer 2 · Tue Dec 31 2024 21:35:09 GMT+0800 (China Standard Time)

Is inference with A100s supported? How about quantization??

mphilippnv · Answer 3 · Tue Dec 31 2024 23:12:34 GMT+0800 (China Standard Time)

Deepseek v3 doesn't appear to support pipeline parallelism. I get this error when attempting to deploy to 2 8x H100 nodes:

NotImplementedError: Pipeline parallelism is only supported for the following  architectures: ['AquilaForCausalLM', 'AquilaModel', 'DeepseekV2ForCausalLM', 'GPT2LMHeadModel', 'InternLM2ForCausalLM', 'InternLMForCausalLM', 'InternVLChatModel', 'JAISLMHeadModel', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'NemotronForCausalLM', 'Phi3ForCausalLM', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'QWenLMHeadModel', 'Qwen2VLForConditionalGeneration'].

I'm using --tensor-parallel-size 8 --pipeline-parallel-size 2

Simon Mo · Answer 4 · Wed Jan 01 2025 01:04:24 GMT+0800 (China Standard Time)

@july8023 It should work on 4090, generally the models takes about 600GB memory, then you want about 100-300GB for KV cache so feel free to plan around that.
@fsaudm A100s are not supported because this models requires FP8 tensor cores.
@mphilippnv which version of vLLM are you using? You might need to update to v0.6.6 or higher.

Farid Saud · Answer 5 · Wed Jan 01 2025 01:36:54 GMT+0800 (China Standard Time)

@simon-mo right, A100s don't support fp8. Would the arg --dtype bfloat16 suffice? If not, I found the bf16 version in Huggingface, any insights on whether that would work?

Simon Mo · Answer 6 · Wed Jan 01 2025 01:38:40 GMT+0800 (China Standard Time)

The model currently does not support --dtype bfloat16 because it is natively trained in fp8. Can you point me to the bf16 version?

Farid Saud · Answer 7 · Wed Jan 01 2025 01:44:53 GMT+0800 (China Standard Time)

@simon-mo on HF: https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/tree/main

, on the official repo they provide a script to cast fp8 to bf16, but of course you can't do it on A100s... my guess is a good soul did it and uploaded it to HF. In the repo, see 6.

https://github.com/deepseek-ai/DeepSeek-V3

Simon Mo · Answer 8 · Wed Jan 01 2025 01:47:51 GMT+0800 (China Standard Time)

vLLM does support this bf16 model on A100. It looks like the config.json properly removed quantization_config so it would already.

mphilippnv · Answer 9 · Wed Jan 01 2025 01:51:07 GMT+0800 (China Standard Time)

@july8023 It should work on 4090, generally the models takes about 600GB memory, then you want about 100-300GB for KV cache so feel free to plan around that. @fsaudm A100s are not supported because this models requires FP8 tensor cores. @mphilippnv which version of vLLM are you using? You might need to update to v0.6.6 or higher.

Using v0.6.6

EDIT: Apologies, I was using 0.6.2. Redeploying helm chart with 0.6.6.post1. Will see how it goes.

Farid Saud · Answer 10 · Wed Jan 01 2025 01:51:59 GMT+0800 (China Standard Time)

Any knowledge of a working example of serving deepseekv3 on A100s with vLLM? I'll try later, but any hints or help is very much appreciated

James · Answer 11 · Thu Jan 02 2025 23:56:38 GMT+0800 (China Standard Time)

Hi everyone,
I’m encountering the following error when trying to run the image vllm/vllm-openai:v0.6.6.post1 on a node equipped with 8x H100 SMX GPUs:

ValueError: Error in model execution (input dumped to /tmp/err_execute_model_input_20250102-072212.pkl): functional_call got multiple values for keys ['mlp.experts.e_score_correction_bias', 'mlp.gate.e_score_correction_bias'], which are tied. Consider using tie_weights=False
2025-01-02T15:22:12.753719474Z

Here’s the command I used:

--model deepseek-ai/DeepSeek-V3-Base \
--tensor-parallel-size 8 \
--disable_log_requests \
--uvicorn_log_level error \
--max-model-len 16384 \
--cpu-offload-gb 400 \
--max_num_seqs 1 \
--trust-remote-code \
--gpu-memory-utilization 0.95 \
--enforce-eager

Does anyone have suggestions or solutions for resolving this issue?

Thanks in advance!

glowwormX · Answer 12 · Tue Jan 07 2025 22:26:24 GMT+0800 (China Standard Time)

Hi everyone, I’m encountering the following error when trying to run the image vllm/vllm-openai:v0.6.6.post1 on a node equipped with 8x H100 SMX GPUs:
ValueError: Error in model execution (input dumped to /tmp/err_execute_model_input_20250102-072212.pkl): functional_call got multiple values for keys ['mlp.experts.e_score_correction_bias', 'mlp.gate.e_score_correction_bias'], which are tied. Consider using tie_weights=False
2025-01-02T15:22:12.753719474Z 
Here’s the command I used:
--model deepseek-ai/DeepSeek-V3-Base \
--tensor-parallel-size 8 \
--disable_log_requests \
--uvicorn_log_level error \
--max-model-len 16384 \
--cpu-offload-gb 400 \
--max_num_seqs 1 \
--trust-remote-code \
--gpu-memory-utilization 0.95 \
--enforce-eager
Does anyone have suggestions or solutions for resolving this issue?

Thanks in advance!

I've had this problem, too. Is there a solution?

Ishaan Datta · Answer 13 · Wed Jan 08 2025 03:02:10 GMT+0800 (China Standard Time)

I've had this problem, too. Is there a solution?

Was getting this error- got resolved by removing cpu offloading... hoping for an explanation.

Also, any suggestions to increase token throughput & context length.
We're stuck at 6 tokens/second, max 10k context length despite 1600GB VRAM.
I am currently running with tensor+pipeline parallelism on 5 Nodes (4x A100 80GB each). The vm are without Infiniband.

Would having Infiniband (i.e. higher inter-node bandwidth & lower latency) be the main solution to increase token throughput? And for context length > 40k, how much more VRAM would be required..?

Shaowei Su · Answer 14 · Thu Jan 09 2025 12:36:23 GMT+0800 (China Standard Time)

I've had this problem, too. Is there a solution?

Was getting this error- got resolved by removing cpu offloading... hoping for an explanation.

Also, any suggestions to increase token throughput & context length. We're stuck at 6 tokens/second, max 10k context length despite 1600GB VRAM. I am currently running with tensor+pipeline parallelism on 5 Nodes (4x A100 80GB each). The vm are without Infiniband.

Would having Infiniband (i.e. higher inter-node bandwidth & lower latency) be the main solution to increase token throughput? And for context length > 40k, how much more VRAM would be required..?

Hi @ishaandatta could you share which model version are you using? I'm getting errors complaining fp8e4nv data type is not supported on CUDA arch < 89 when loading the model on A100 GPUs. Or maybe you are on the bf16 version? https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/tree/main. Thanks

Mingjie Tang · Answer 15 · Thu Jan 09 2025 15:05:57 GMT+0800 (China Standard Time)

we also run int over very slow token processing speed like 3 token/s, even if we use h100 and IB. any suggestions?

Leonard · Answer 16 · Thu Jan 09 2025 20:51:12 GMT+0800 (China Standard Time)

we also run int over very slow token processing speed like 3 token/s, even if we use h100 and IB. any suggestions?

I found tp16 to be about 2X faster than pp=2 tp=8 w/ 2 x H100 nodes. Here's my testing: https://llm-tracker.info/DeepSeek-V3-Testing

Here's vLLM vs SGLang at concurrency=64 atm:

Note, I found that vLLM has some stop token errors for output (that SGLang doesn't have) w/ some of my testing.

hank · Answer 17 · Thu Jan 09 2025 21:18:40 GMT+0800 (China Standard Time)

Same issue. I used 16 H100 GPUs, set TP=16, deployed using ray in k8s, and opened the IB network. I made a simple curl request, input 10 tokens, and output 242 tokens. This curl test It took 44 seconds. Can anyone help me figure out why?

Mingjie Tang · Answer 18 · Fri Jan 10 2025 00:58:28 GMT+0800 (China Standard Time)

does the perf issues related to the MOE opt ? it is not included in the current version.?

Ishaan Datta · Answer 19 · Fri Jan 10 2025 03:40:58 GMT+0800 (China Standard Time)

@shaowei-su I'm using the bf16 version you linked.

@lhl thank you for sharing this! I'm currently using tp=4 pp=6 as we're aiming for context lengths > 64k.
Just to clarify, your benchmarks indicate ~5 output tokens/s on vLLM & around 10 for SGLang ?
If so- I am wondering as to how deepseek-chat is able to achieve their throughput, I measured it at over 60 output tokens/sec

Leonard · Answer 20 · Fri Jan 10 2025 08:24:43 GMT+0800 (China Standard Time)

Just to clarify, your benchmarks indicate ~5 output tokens/s on vLLM & around 10 for SGLang ?

for bs=1 SGLang outputs around 26 tok/s:

(sglang) ubuntu@ip-10-1-1-135:~$ python3 -m sglang.bench_serving --backend sglang --num-prompts 50 --max-concurrency 1 --port 8000
Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=8000, dataset_name='sharegpt', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=50, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=0.0, request_rate=inf, max_concurrency=1, seed=1, multi=False, request_rate_range='2,34,2', output_file=None, disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None)

#Input tokens: 10354
#Output tokens: 11509
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [07:20<00:00,  8.82s/it]

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max reqeuest concurrency:                1
Successful requests:                     50
Benchmark duration (s):                  440.98
Total input tokens:                      10354
Total generated tokens:                  11509
Total generated tokens (retokenized):    11467
Request throughput (req/s):              0.11
Input token throughput (tok/s):          23.48
Output token throughput (tok/s):         26.10
Total token throughput (tok/s):          49.58
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   8819.11
Median E2E Latency (ms):                 4817.32
---------------Time to First Token----------------
Mean TTFT (ms):                          318.37
Median TTFT (ms):                        259.02
P99 TTFT (ms):                           1658.59
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          36.41
Median TPOT (ms):                        36.97
P99 TPOT (ms):                           37.60
---------------Inter-token Latency----------------
Mean ITL (ms):                           37.18
Median ITL (ms):                         37.06
P99 ITL (ms):                            38.91
==================================================

You should read the DeepSeek Technical Report in the infrastructure, they deploy in 320 GPU blocks w/ specialized/separated functions.

That being said, there's certainly optimizations that can be made for "regular" inference. On vLLM, when doing throughput optimization, with some tuning I can generate >7000 tok/s on a single H100 node for a Llama 3 70B class model at c=512. DSv3 has about half the activations, and at c=512 sglang currently tops out at about 1100 tok/s on 2xH100 nodes (vLLM is about half of that). You could imagine that there might be a 5-10X in throughput optimization available based naively on activations/fwd pass. This is before spec decode like EAGLE or Medusa is factored in.

hank · Answer 21 · Fri Jan 10 2025 09:35:37 GMT+0800 (China Standard Time)

@simon-mo Is there any way or plan to improve the speed of vllm on deepseek v3? Thanks a lot

Peter Pan · Answer 22 · Sat Jan 11 2025 17:50:27 GMT+0800 (China Standard Time)

we also see 3 token/s on 16x H20 with TP=8,PP=2

Ed Sealing · Answer 23 · Tue Jan 14 2025 08:46:50 GMT+0800 (China Standard Time)

When I tested TP=16 on GH200 nodes (FP8 version), I was getting ~7.1 t/s (single batch). Ironically, when I used TP=8 (max_model_len=2048 so it all fit), I was getting slightly faster, which seemed strange.

One of the issues that might be slowing VLLM down is that one of the MoE specific CUDA kernels is hard-coded for DSv3 to force the use of Global memory, which is significantly slower than shared memory. This is due to the limited amount of shared memory available (dependent on the GPU model... for example, the H100 has 227KB of shared memory per block).
https://github.com/vllm-project/vllm/blob/main/csrc/moe/moe_align_sum_kernels.cu#L232

I don't know how much effect this has for this specific kernel, but it likely has some consequence. Techniques like distributed shared memory (H100+ specific) might be able to be used, or only keeping the active experts in there... but unfortunately I don't know much about CUDA programming. Spent 2 days messing trying to implement the "active-expert only" approach, but only served to slow down to 4.5 t/s...

Peng Xiong · Answer 24 · Wed Jan 15 2025 21:41:10 GMT+0800 (China Standard Time)

您好。请问现在使用vllm部署，要支持tool call功能，应该使用哪个parser？

WangxuP · Answer 25 · Thu Jan 16 2025 11:46:59 GMT+0800 (China Standard Time)

vLLM does support this bf16 model on A100. It looks like the config.json properly removed quantization_config so it would already.

I use vllm==0.6.6.post1 can support this feature?

Teknium · Answer 26 · Mon Jan 27 2025 14:31:23 GMT+0800 (China Standard Time)

Can anyone explain why we can only get like 7tok/s across 2hgxs in any configuration over verified 3.2tbps IB

Lucas Pickup · Answer 27 · Tue Jan 28 2025 08:02:38 GMT+0800 (China Standard Time)

Can anyone explain why we can only get like 7tok/s across 2hgxs in any configuration over verified 3.2tbps IB

I get 10.5 tok/s (1 sequence) on 8*MI300x using sglang, just for reference.

PSEUDOTENSOR / Jonathan McKinney · Answer 28 · Tue Jan 28 2025 17:43:35 GMT+0800 (China Standard Time)

Same, about 13 tokens/sec (1 sequence, long output) on 8*MI300x using sglang. It has an unexpected TTFT lag for me of about 2 seconds though.

Xin Huang · Answer 29 · Mon Feb 03 2025 04:37:21 GMT+0800 (China Standard Time)

we also run int over very slow token processing speed like 3 token/s, even if we use h100 and IB. any suggestions?

I found tp16 to be about 2X faster than pp=2 tp=8 w/ 2 x H100 nodes. Here's my testing: https://llm-tracker.info/DeepSeek-V3-Testing

Here's vLLM vs SGLang at concurrency=64 atm:

Note, I found that vLLM has some stop token errors for output (that SGLang doesn't have) w/ some of my testing.

Hi @lhl ! Could you please share how you setup multi-nodes of 8 H100 instances for VLLM? I followed VLLM distributed inference document to setup two nodes of 8H100 and I am able to see following with ray status.

Active:
 1 node_62bc8c92be4ee6912d3ac7sfsff5db8acf209daa9e
 1 node_ed25244254634eb76cfsfdfc7db4cf366b4a86c9b
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/384.0 CPU
 0.0/16.0 GPU
 0B/3.88TiB memory
 0B/19.46GiB object_store_memory

Demands:
 (no resource demands)

Then I try to deploy the VLLM engine by following code

MODEL_ID=/mylocal/DeepSeek/DeepSeek-R1, 
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
 python -m vllm.entrypoints.openai.api_server \
 --model $MODEL_ID \
 --port 8002 \
 --tensor-parallel-size 16 \
 --max-model-len 20000 \
 --trust-remote-code \
 --distributed-executor-backend ray

But I kept seeing

Started a local Ray instance. View the dashboard at 127.0.0.1:8266 
WARNING 02-02 19:54:20 ray_utils.py:315] The number of required GPUs exceeds the total number of available GPUs in the placement group.
INFO 02-02 19:54:30 ray_utils.py:212] Waiting for creating a placement group of specs for 10 seconds. specs=[{'GPU': 1.0, 'node:172.31.42.4': 0.001}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}]. Check `ray status` to see if you have enough resources.

Any instruction will be much appreciated! CC: @pseudotensor @teknium1

Leonard · Answer 30 · Mon Feb 03 2025 16:35:54 GMT+0800 (China Standard Time)

Hi @lhl ! Could you please share how you setup multi-nodes of 8 H100 instances for VLLM? I followed VLLM distributed inference document to setup two nodes of 8H100 and I am able to see following with ray status.

If you are having trouble with the docs https://docs.vllm.ai/en/latest/serving/distributed_serving.html#running-vllm-on-multiple-nodes and the referenced helper script, I'm not sure I can help - I spent a fair amount of work adapting ray to play nice with my slurm setup, so it's not very applicable for raw nodes. I'd maybe search or start a "discussion" thread and see if you can get an answer.

Barring that, I will have to say that sglang's multi-node launching is dead simple, so you could give that a spin if you can't get vLLM working: https://docs.sglang.ai/backend/server_arguments.html#common-launch-commands

Chengji Yao · Answer 31 · Wed Feb 05 2025 02:21:34 GMT+0800 (China Standard Time)

Thans @simon-mo , does the EP support includes the optimization for the shared expert(s) as how it described in the DeepSeek V3 paper?

ehuaa · Answer 32 · Fri Feb 07 2025 20:39:48 GMT+0800 (China Standard Time)

@simon-mo on HF: https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/tree/main

, on the official repo they provide a script to cast fp8 to bf16, but of course you can't do it on A100s... my guess is a good soul did it and uploaded it to HF. In the repo, see 6.

https://github.com/deepseek-ai/DeepSeek-V3

I wonder why this conversion cannot performed on A100s, this script seems don't need load all of the deepseek-v3 model on gpu. @fsaudm Can you show me the reason? Thanks

lambert0312 · Answer 33 · Sat Feb 08 2025 10:24:12 GMT+0800 (China Standard Time)

@simon-mo on HF: https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/tree/main
, on the official repo they provide a script to cast fp8 to bf16, but of course you can't do it on A100s... my guess is a good soul did it and uploaded it to HF. In the repo, see 6.
https://github.com/deepseek-ai/DeepSeek-V3

I wonder why this conversion cannot performed on A100s, this script seems don't need load all of the deepseek-v3 model on gpu. @fsaudm Can you show me the reason? Thanks

Because the script processes weight files one by one, it reads the FP8 weights and converts them into BF16 format through calculation. Because GPUs such as A100/A800 cannot process the FP8 format, it cannot be used. @ehuaa

Rick · Answer 34 · Sat Feb 08 2025 21:05:25 GMT+0800 (China Standard Time)

May I ask if anyone has used 4 sets of 8 * H100 for inference services and if VLLM can run successfully?

mphilippnv · Answer 35 · Sat Feb 08 2025 23:33:48 GMT+0800 (China Standard Time)

May I ask if anyone has used 4 sets of 8 * H100 for inference services and if VLLM can run successfully?

Yes. It's incredibly slow though. Like 6 token/s.

xzy · Answer 36 · Mon Feb 10 2025 10:23:53 GMT+0800 (China Standard Time)

Hi @lhl ! Could you please share how you setup multi-nodes of 8 H100 instances for VLLM? I followed VLLM distributed inference document to setup two nodes of 8H100 and I am able to see following with ray status.

Active:
 1 node_62bc8c92be4ee6912d3ac7sfsff5db8acf209daa9e
 1 node_ed25244254634eb76cfsfdfc7db4cf366b4a86c9b
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/384.0 CPU
 0.0/16.0 GPU
 0B/3.88TiB memory
 0B/19.46GiB object_store_memory

Demands:
 (no resource demands)

Then I try to deploy the VLLM engine by following code

MODEL_ID=/mylocal/DeepSeek/DeepSeek-R1, 
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
 python -m vllm.entrypoints.openai.api_server \
 --model $MODEL_ID \
 --port 8002 \
 --tensor-parallel-size 16 \
 --max-model-len 20000 \
 --trust-remote-code \
 --distributed-executor-backend ray

But I kept seeing

Started a local Ray instance. View the dashboard at 127.0.0.1:8266 
WARNING 02-02 19:54:20 ray_utils.py:315] The number of required GPUs exceeds the total number of available GPUs in the placement group.
INFO 02-02 19:54:30 ray_utils.py:212] Waiting for creating a placement group of specs for 10 seconds. specs=[{'GPU': 1.0, 'node:172.31.42.4': 0.001}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}]. Check `ray status` to see if you have enough resources.

Any instruction will be much appreciated! CC: @pseudotensor @teknium1

@Neo9061 Hello, I think the CUDA_VISIBLE_DEVICES environment variable is misconfigured. CUDA_VISIBLE_DEVICES should be used to set the GPU of the current node. For example, two 8*H100 machines, the CUDA_VISIBLE_DEVICES environment variable on both machines should be 0,1,2,3,4,5,6,7. This is how I set it. You can try it.

Rick · Answer 37 · Mon Feb 10 2025 12:25:06 GMT+0800 (China Standard Time)

May I ask if anyone has used 4 sets of 8 * H100 for inference services and if VLLM can run successfully?

Yes. It's incredibly slow though. Like 6 token/s.

May I ask if this is using FP8 precision or BF16 precision? It seems that four H800 machines with FP8 precision cannot run

mphilippnv · Answer 38 · Mon Feb 10 2025 12:28:29 GMT+0800 (China Standard Time)

May I ask if anyone has used 4 sets of 8 * H100 for inference services and if VLLM can run successfully?

Yes. It's incredibly slow though. Like 6 token/s.

May I ask if this is using FP8 precision or BF16 precision? It seems that four H800 machines with FP8 precision cannot run

I think it was bf16. It was whatever the standard settings are.

Rick · Answer 39 · Mon Feb 10 2025 12:35:09 GMT+0800 (China Standard Time)

May I ask if anyone has used 4 sets of 8 * H100 for inference services and if VLLM can run successfully?

Yes. It's incredibly slow though. Like 6 token/s.

May I ask if this is using FP8 precision or BF16 precision? It seems that four H800 machines with FP8 precision cannot run

I think it was bf16. It was whatever the standard settings are.

Thank you, yes, I have also verified that BF16 is feasible, but FP8 cannot run smoothly, possibly because VLLM parallel strategy is not yet supported

Zhao Chen · Answer 40 · Tue Feb 11 2025 15:56:10 GMT+0800 (China Standard Time)

Thank you @simon-mo ! Do you have plans to support sequence parallelism?

dinglingfeng · Answer 41 · Fri Feb 14 2025 20:05:38 GMT+0800 (China Standard Time)

I've had this problem, too. Is there a solution?

Was getting this error- got resolved by removing cpu offloading... hoping for an explanation.
Also, any suggestions to increase token throughput & context length. We're stuck at 6 tokens/second, max 10k context length despite 1600GB VRAM. I am currently running with tensor+pipeline parallelism on 5 Nodes (4x A100 80GB each). The vm are without Infiniband.
Would having Infiniband (i.e. higher inter-node bandwidth & lower latency) be the main solution to increase token throughput? And for context length > 40k, how much more VRAM would be required..?

Hi @ishaandatta could you share which model version are you using? I'm getting errors complaining fp8e4nv data type is not supported on CUDA arch < 89 when loading the model on A100 GPUs. Or maybe you are on the bf16 version? https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/tree/main. Thanks

Hi I've had this problem, too. Is there a solution?

Rick · Answer 42 · Sun Feb 16 2025 18:22:06 GMT+0800 (China Standard Time)

After using VLLM for a period of time, there may be an error message indicating insufficient CUDA memory. Have you encountered this before? How should we handle it?

`
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.28 GiB. GPU 0 has a total capacity of 95.22 GiB of which 3.11 GiB is free. Including non-PyTorch memory, this process has 92.10 GiB memory in use. Of the allocated memory 73
.21 GiB is allocated by PyTorch, with 82.00 MiB allocated in private pools (e.g., CUDA Graphs), and 2.87 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandab
le_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
File "/root/anaconda3/envs/deepseek/lib/python3.12/site-packages/vllm/engine/async_llm_engine.py", line 70, in _log_task_completion
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
CRITICAL 02-16 02:55:40 launcher.py:74] AsyncLLMEngine has failed, terminating server process`

ztxdcyy · Answer 43 · Tue Feb 18 2025 14:38:26 GMT+0800 (China Standard Time)

After using VLLM for a period of time, there may be an error message indicating insufficient CUDA memory. Have you encountered this before? How should we handle it?

` torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.28 GiB. GPU 0 has a total capacity of 95.22 GiB of which 3.11 GiB is free. Including non-PyTorch memory, this process has 92.10 GiB memory in use. Of the allocated memory 73 .21 GiB is allocated by PyTorch, with 82.00 MiB allocated in private pools (e.g., CUDA Graphs), and 2.87 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandab le_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run File "/root/anaconda3/envs/deepseek/lib/python3.12/site-packages/vllm/engine/async_llm_engine.py", line 70, in _log_task_completion raise AsyncEngineDeadError( vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause. CRITICAL 02-16 02:55:40 launcher.py:74] AsyncLLMEngine has failed, terminating server process`

try：--enforce eager, maybe helpful