[Model] DeepSeek-V3 Enhancements
simon-mo opened this issue · comments
This issue tracks follow up enhancements after initial support for the Deepseek V3 model. Please feel free to chime in and contribute!
- Follow up #11523: enhance testing with shapes of production models and run it regularly on H100.
- Solving via cutlas blockwise quantization kernels.
- Follow up #11502:
- Test and enable torch.compile
-
Refactor MoEMethodBase to unify and clean up the extra arguments ofscoring_func
ande_correction_bias
- Kernel tuning for 8xH200, MI300x, H100 (TP16 and TP8PP2 case)
- Use https://github.com/vllm-project/vllm/blob/main/benchmarks/kernels/benchmark_moe.py, but adapt it for the w8a8 fused moe kernel.
- CUDA Graph support
- MLA #10927 @simon-mo
- Support nextn prediction heads (EAGLE style prediction heads)
- Support expert parallelism for MoE.
- Support data parallelism for MLA.
If I want to deploy deepseek 600B use vllm and RTX4090, are there any restrictions? How many RTX 4090 do I need at least?
Is inference with A100s supported? How about quantization??
Deepseek v3 doesn't appear to support pipeline parallelism. I get this error when attempting to deploy to 2 8x H100 nodes:
NotImplementedError: Pipeline parallelism is only supported for the following architectures: ['AquilaForCausalLM', 'AquilaModel', 'DeepseekV2ForCausalLM', 'GPT2LMHeadModel', 'InternLM2ForCausalLM', 'InternLMForCausalLM', 'InternVLChatModel', 'JAISLMHeadModel', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'NemotronForCausalLM', 'Phi3ForCausalLM', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'QWenLMHeadModel', 'Qwen2VLForConditionalGeneration'].
I'm using --tensor-parallel-size 8 --pipeline-parallel-size 2
@july8023 It should work on 4090, generally the models takes about 600GB memory, then you want about 100-300GB for KV cache so feel free to plan around that.
@fsaudm A100s are not supported because this models requires FP8 tensor cores.
@mphilippnv which version of vLLM are you using? You might need to update to v0.6.6 or higher.
@simon-mo right, A100s don't support fp8. Would the arg --dtype bfloat16 suffice? If not, I found the bf16 version in Huggingface, any insights on whether that would work?
The model currently does not support --dtype bfloat16 because it is natively trained in fp8. Can you point me to the bf16 version?
@simon-mo on HF: https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/tree/main
, on the official repo they provide a script to cast fp8 to bf16, but of course you can't do it on A100s... my guess is a good soul did it and uploaded it to HF. In the repo, see 6.
vLLM does support this bf16 model on A100. It looks like the config.json properly removed quantization_config
so it would already.
@july8023 It should work on 4090, generally the models takes about 600GB memory, then you want about 100-300GB for KV cache so feel free to plan around that. @fsaudm A100s are not supported because this models requires FP8 tensor cores. @mphilippnv which version of vLLM are you using? You might need to update to v0.6.6 or higher.
Using v0.6.6
EDIT: Apologies, I was using 0.6.2. Redeploying helm chart with 0.6.6.post1. Will see how it goes.
Any knowledge of a working example of serving deepseekv3 on A100s with vLLM? I'll try later, but any hints or help is very much appreciated
Hi everyone,
I’m encountering the following error when trying to run the image vllm/vllm-openai:v0.6.6.post1 on a node equipped with 8x H100 SMX GPUs:
ValueError: Error in model execution (input dumped to /tmp/err_execute_model_input_20250102-072212.pkl): functional_call got multiple values for keys ['mlp.experts.e_score_correction_bias', 'mlp.gate.e_score_correction_bias'], which are tied. Consider using tie_weights=False
2025-01-02T15:22:12.753719474Z
Here’s the command I used:
--model deepseek-ai/DeepSeek-V3-Base \
--tensor-parallel-size 8 \
--disable_log_requests \
--uvicorn_log_level error \
--max-model-len 16384 \
--cpu-offload-gb 400 \
--max_num_seqs 1 \
--trust-remote-code \
--gpu-memory-utilization 0.95 \
--enforce-eager
Does anyone have suggestions or solutions for resolving this issue?
Thanks in advance!
Hi everyone, I’m encountering the following error when trying to run the image vllm/vllm-openai:v0.6.6.post1 on a node equipped with 8x H100 SMX GPUs:
ValueError: Error in model execution (input dumped to /tmp/err_execute_model_input_20250102-072212.pkl): functional_call got multiple values for keys ['mlp.experts.e_score_correction_bias', 'mlp.gate.e_score_correction_bias'], which are tied. Consider using tie_weights=False 2025-01-02T15:22:12.753719474Z
Here’s the command I used:
--model deepseek-ai/DeepSeek-V3-Base \ --tensor-parallel-size 8 \ --disable_log_requests \ --uvicorn_log_level error \ --max-model-len 16384 \ --cpu-offload-gb 400 \ --max_num_seqs 1 \ --trust-remote-code \ --gpu-memory-utilization 0.95 \ --enforce-eager
Does anyone have suggestions or solutions for resolving this issue?
Thanks in advance!
I've had this problem, too. Is there a solution?
I've had this problem, too. Is there a solution?
Was getting this error- got resolved by removing cpu offloading... hoping for an explanation.
Also, any suggestions to increase token throughput & context length.
We're stuck at 6 tokens/second, max 10k context length despite 1600GB VRAM.
I am currently running with tensor+pipeline parallelism on 5 Nodes (4x A100 80GB each). The vm are without Infiniband.
Would having Infiniband (i.e. higher inter-node bandwidth & lower latency) be the main solution to increase token throughput? And for context length > 40k, how much more VRAM would be required..?
I've had this problem, too. Is there a solution?
Was getting this error- got resolved by removing cpu offloading... hoping for an explanation.
Also, any suggestions to increase token throughput & context length. We're stuck at 6 tokens/second, max 10k context length despite 1600GB VRAM. I am currently running with tensor+pipeline parallelism on 5 Nodes (4x A100 80GB each). The vm are without Infiniband.
Would having Infiniband (i.e. higher inter-node bandwidth & lower latency) be the main solution to increase token throughput? And for context length > 40k, how much more VRAM would be required..?
Hi @ishaandatta could you share which model version are you using? I'm getting errors complaining fp8e4nv data type is not supported on CUDA arch < 89
when loading the model on A100 GPUs. Or maybe you are on the bf16 version? https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/tree/main. Thanks
we also run int over very slow token processing speed like 3 token/s, even if we use h100 and IB. any suggestions?
we also run int over very slow token processing speed like 3 token/s, even if we use h100 and IB. any suggestions?
I found tp16 to be about 2X faster than pp=2 tp=8 w/ 2 x H100 nodes. Here's my testing: https://llm-tracker.info/DeepSeek-V3-Testing
Here's vLLM vs SGLang at concurrency=64 atm:
Note, I found that vLLM has some stop token errors for output (that SGLang doesn't have) w/ some of my testing.
Same issue. I used 16 H100 GPUs, set TP=16, deployed using ray in k8s, and opened the IB network. I made a simple curl request, input 10 tokens, and output 242 tokens. This curl test It took 44 seconds. Can anyone help me figure out why?
does the perf issues related to the MOE opt ? it is not included in the current version.?
@shaowei-su I'm using the bf16 version you linked.
@lhl thank you for sharing this! I'm currently using tp=4 pp=6 as we're aiming for context lengths > 64k.
Just to clarify, your benchmarks indicate ~5 output tokens/s on vLLM & around 10 for SGLang ?
If so- I am wondering as to how deepseek-chat is able to achieve their throughput, I measured it at over 60 output tokens/sec
Just to clarify, your benchmarks indicate ~5 output tokens/s on vLLM & around 10 for SGLang ?
for bs=1 SGLang outputs around 26 tok/s:
(sglang) ubuntu@ip-10-1-1-135:~$ python3 -m sglang.bench_serving --backend sglang --num-prompts 50 --max-concurrency 1 --port 8000
Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=8000, dataset_name='sharegpt', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=50, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=0.0, request_rate=inf, max_concurrency=1, seed=1, multi=False, request_rate_range='2,34,2', output_file=None, disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None)
#Input tokens: 10354
#Output tokens: 11509
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [07:20<00:00, 8.82s/it]
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max reqeuest concurrency: 1
Successful requests: 50
Benchmark duration (s): 440.98
Total input tokens: 10354
Total generated tokens: 11509
Total generated tokens (retokenized): 11467
Request throughput (req/s): 0.11
Input token throughput (tok/s): 23.48
Output token throughput (tok/s): 26.10
Total token throughput (tok/s): 49.58
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 8819.11
Median E2E Latency (ms): 4817.32
---------------Time to First Token----------------
Mean TTFT (ms): 318.37
Median TTFT (ms): 259.02
P99 TTFT (ms): 1658.59
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 36.41
Median TPOT (ms): 36.97
P99 TPOT (ms): 37.60
---------------Inter-token Latency----------------
Mean ITL (ms): 37.18
Median ITL (ms): 37.06
P99 ITL (ms): 38.91
==================================================
You should read the DeepSeek Technical Report in the infrastructure, they deploy in 320 GPU blocks w/ specialized/separated functions.
That being said, there's certainly optimizations that can be made for "regular" inference. On vLLM, when doing throughput optimization, with some tuning I can generate >7000 tok/s on a single H100 node for a Llama 3 70B class model at c=512. DSv3 has about half the activations, and at c=512 sglang currently tops out at about 1100 tok/s on 2xH100 nodes (vLLM is about half of that). You could imagine that there might be a 5-10X in throughput optimization available based naively on activations/fwd pass. This is before spec decode like EAGLE or Medusa is factored in.
@simon-mo Is there any way or plan to improve the speed of vllm on deepseek v3? Thanks a lot
we also see 3 token/s on 16x H20 with TP=8,PP=2
When I tested TP=16 on GH200 nodes (FP8 version), I was getting ~7.1 t/s (single batch). Ironically, when I used TP=8 (max_model_len=2048 so it all fit), I was getting slightly faster, which seemed strange.
One of the issues that might be slowing VLLM down is that one of the MoE specific CUDA kernels is hard-coded for DSv3 to force the use of Global memory, which is significantly slower than shared memory. This is due to the limited amount of shared memory available (dependent on the GPU model... for example, the H100 has 227KB of shared memory per block).
https://github.com/vllm-project/vllm/blob/main/csrc/moe/moe_align_sum_kernels.cu#L232
I don't know how much effect this has for this specific kernel, but it likely has some consequence. Techniques like distributed shared memory (H100+ specific) might be able to be used, or only keeping the active experts in there... but unfortunately I don't know much about CUDA programming. Spent 2 days messing trying to implement the "active-expert only" approach, but only served to slow down to 4.5 t/s...
您好。请问现在使用vllm部署,要支持tool call功能,应该使用哪个parser?
vLLM does support this bf16 model on A100. It looks like the config.json properly removed
quantization_config
so it would already.
I use vllm==0.6.6.post1 can support this feature?
Can anyone explain why we can only get like 7tok/s across 2hgxs in any configuration over verified 3.2tbps IB
Can anyone explain why we can only get like 7tok/s across 2hgxs in any configuration over verified 3.2tbps IB
I get 10.5 tok/s (1 sequence) on 8*MI300x using sglang, just for reference.
Same, about 13 tokens/sec (1 sequence, long output) on 8*MI300x using sglang. It has an unexpected TTFT lag for me of about 2 seconds though.
we also run int over very slow token processing speed like 3 token/s, even if we use h100 and IB. any suggestions?
I found tp16 to be about 2X faster than pp=2 tp=8 w/ 2 x H100 nodes. Here's my testing: https://llm-tracker.info/DeepSeek-V3-Testing
Here's vLLM vs SGLang at concurrency=64 atm:
Note, I found that vLLM has some stop token errors for output (that SGLang doesn't have) w/ some of my testing.
Hi @lhl ! Could you please share how you setup multi-nodes of 8 H100 instances for VLLM? I followed VLLM distributed inference document to setup two nodes of 8H100 and I am able to see following with ray status
.
Active:
1 node_62bc8c92be4ee6912d3ac7sfsff5db8acf209daa9e
1 node_ed25244254634eb76cfsfdfc7db4cf366b4a86c9b
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/384.0 CPU
0.0/16.0 GPU
0B/3.88TiB memory
0B/19.46GiB object_store_memory
Demands:
(no resource demands)
Then I try to deploy the VLLM engine by following code
MODEL_ID=/mylocal/DeepSeek/DeepSeek-R1,
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
python -m vllm.entrypoints.openai.api_server \
--model $MODEL_ID \
--port 8002 \
--tensor-parallel-size 16 \
--max-model-len 20000 \
--trust-remote-code \
--distributed-executor-backend ray
But I kept seeing
Started a local Ray instance. View the dashboard at 127.0.0.1:8266
WARNING 02-02 19:54:20 ray_utils.py:315] The number of required GPUs exceeds the total number of available GPUs in the placement group.
INFO 02-02 19:54:30 ray_utils.py:212] Waiting for creating a placement group of specs for 10 seconds. specs=[{'GPU': 1.0, 'node:172.31.42.4': 0.001}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}]. Check `ray status` to see if you have enough resources.
Any instruction will be much appreciated! CC: @pseudotensor @teknium1
Hi @lhl ! Could you please share how you setup multi-nodes of 8 H100 instances for VLLM? I followed VLLM distributed inference document to setup two nodes of 8H100 and I am able to see following with
ray status
.
If you are having trouble with the docs https://docs.vllm.ai/en/latest/serving/distributed_serving.html#running-vllm-on-multiple-nodes and the referenced helper script, I'm not sure I can help - I spent a fair amount of work adapting ray to play nice with my slurm setup, so it's not very applicable for raw nodes. I'd maybe search or start a "discussion" thread and see if you can get an answer.
Barring that, I will have to say that sglang's multi-node launching is dead simple, so you could give that a spin if you can't get vLLM working: https://docs.sglang.ai/backend/server_arguments.html#common-launch-commands
Thans @simon-mo , does the EP support includes the optimization for the shared expert(s) as how it described in the DeepSeek V3 paper?
@simon-mo on HF: https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/tree/main
, on the official repo they provide a script to cast fp8 to bf16, but of course you can't do it on A100s... my guess is a good soul did it and uploaded it to HF. In the repo, see 6.
I wonder why this conversion cannot performed on A100s, this script seems don't need load all of the deepseek-v3 model on gpu. @fsaudm Can you show me the reason? Thanks
@simon-mo on HF: https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/tree/main
, on the official repo they provide a script to cast fp8 to bf16, but of course you can't do it on A100s... my guess is a good soul did it and uploaded it to HF. In the repo, see 6.
https://github.com/deepseek-ai/DeepSeek-V3I wonder why this conversion cannot performed on A100s, this script seems don't need load all of the deepseek-v3 model on gpu. @fsaudm Can you show me the reason? Thanks
Because the script processes weight files one by one, it reads the FP8 weights and converts them into BF16 format through calculation. Because GPUs such as A100/A800 cannot process the FP8 format, it cannot be used. @ehuaa
May I ask if anyone has used 4 sets of 8 * H100 for inference services and if VLLM can run successfully?
May I ask if anyone has used 4 sets of 8 * H100 for inference services and if VLLM can run successfully?
Yes. It's incredibly slow though. Like 6 token/s.
Hi @lhl ! Could you please share how you setup multi-nodes of 8 H100 instances for VLLM? I followed VLLM distributed inference document to setup two nodes of 8H100 and I am able to see following with
ray status
.Active: 1 node_62bc8c92be4ee6912d3ac7sfsff5db8acf209daa9e 1 node_ed25244254634eb76cfsfdfc7db4cf366b4a86c9b Pending: (no pending nodes) Recent failures: (no failures) Resources --------------------------------------------------------------- Usage: 0.0/384.0 CPU 0.0/16.0 GPU 0B/3.88TiB memory 0B/19.46GiB object_store_memory Demands: (no resource demands)
Then I try to deploy the VLLM engine by following code
MODEL_ID=/mylocal/DeepSeek/DeepSeek-R1, CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \ python -m vllm.entrypoints.openai.api_server \ --model $MODEL_ID \ --port 8002 \ --tensor-parallel-size 16 \ --max-model-len 20000 \ --trust-remote-code \ --distributed-executor-backend ray
But I kept seeing
Started a local Ray instance. View the dashboard at 127.0.0.1:8266 WARNING 02-02 19:54:20 ray_utils.py:315] The number of required GPUs exceeds the total number of available GPUs in the placement group. INFO 02-02 19:54:30 ray_utils.py:212] Waiting for creating a placement group of specs for 10 seconds. specs=[{'GPU': 1.0, 'node:172.31.42.4': 0.001}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}, {'GPU': 1.0}]. Check `ray status` to see if you have enough resources.
Any instruction will be much appreciated! CC: @pseudotensor @teknium1
@Neo9061 Hello, I think the CUDA_VISIBLE_DEVICES environment variable is misconfigured. CUDA_VISIBLE_DEVICES should be used to set the GPU of the current node. For example, two 8*H100 machines, the CUDA_VISIBLE_DEVICES environment variable on both machines should be 0,1,2,3,4,5,6,7. This is how I set it. You can try it.
May I ask if anyone has used 4 sets of 8 * H100 for inference services and if VLLM can run successfully?
Yes. It's incredibly slow though. Like 6 token/s.
May I ask if this is using FP8 precision or BF16 precision? It seems that four H800 machines with FP8 precision cannot run
May I ask if anyone has used 4 sets of 8 * H100 for inference services and if VLLM can run successfully?
Yes. It's incredibly slow though. Like 6 token/s.
May I ask if this is using FP8 precision or BF16 precision? It seems that four H800 machines with FP8 precision cannot run
I think it was bf16. It was whatever the standard settings are.
May I ask if anyone has used 4 sets of 8 * H100 for inference services and if VLLM can run successfully?
Yes. It's incredibly slow though. Like 6 token/s.
May I ask if this is using FP8 precision or BF16 precision? It seems that four H800 machines with FP8 precision cannot run
I think it was bf16. It was whatever the standard settings are.
Thank you, yes, I have also verified that BF16 is feasible, but FP8 cannot run smoothly, possibly because VLLM parallel strategy is not yet supported
I've had this problem, too. Is there a solution?
Was getting this error- got resolved by removing cpu offloading... hoping for an explanation.
Also, any suggestions to increase token throughput & context length. We're stuck at 6 tokens/second, max 10k context length despite 1600GB VRAM. I am currently running with tensor+pipeline parallelism on 5 Nodes (4x A100 80GB each). The vm are without Infiniband.
Would having Infiniband (i.e. higher inter-node bandwidth & lower latency) be the main solution to increase token throughput? And for context length > 40k, how much more VRAM would be required..?Hi @ishaandatta could you share which model version are you using? I'm getting errors complaining
fp8e4nv data type is not supported on CUDA arch < 89
when loading the model on A100 GPUs. Or maybe you are on the bf16 version? https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/tree/main. Thanks
Hi I've had this problem, too. Is there a solution?
After using VLLM for a period of time, there may be an error message indicating insufficient CUDA memory. Have you encountered this before? How should we handle it?
`
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.28 GiB. GPU 0 has a total capacity of 95.22 GiB of which 3.11 GiB is free. Including non-PyTorch memory, this process has 92.10 GiB memory in use. Of the allocated memory 73
.21 GiB is allocated by PyTorch, with 82.00 MiB allocated in private pools (e.g., CUDA Graphs), and 2.87 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandab
le_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
File "/root/anaconda3/envs/deepseek/lib/python3.12/site-packages/vllm/engine/async_llm_engine.py", line 70, in _log_task_completion
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
CRITICAL 02-16 02:55:40 launcher.py:74] AsyncLLMEngine has failed, terminating server process`
After using VLLM for a period of time, there may be an error message indicating insufficient CUDA memory. Have you encountered this before? How should we handle it?
` torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.28 GiB. GPU 0 has a total capacity of 95.22 GiB of which 3.11 GiB is free. Including non-PyTorch memory, this process has 92.10 GiB memory in use. Of the allocated memory 73 .21 GiB is allocated by PyTorch, with 82.00 MiB allocated in private pools (e.g., CUDA Graphs), and 2.87 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandab le_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run File "/root/anaconda3/envs/deepseek/lib/python3.12/site-packages/vllm/engine/async_llm_engine.py", line 70, in _log_task_completion raise AsyncEngineDeadError( vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause. CRITICAL 02-16 02:55:40 launcher.py:74] AsyncLLMEngine has failed, terminating server process`
try:--enforce eager, maybe helpful