deepseek-ai / DeepGEMM

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Repository from Github https://github.comdeepseek-ai/DeepGEMMRepository from Github https://github.comdeepseek-ai/DeepGEMM

RuntimeError: Failed: CUDA runtime error csrc/jit/kernel_runtime.hpp:108 '98'

krishung5 opened this issue · comments

Seeing an CUDA runtime error coming from deep_gemm.py when running vLLM WideEP multinode with DeepSeek R1 using commits after #112.

Same issue reported here. It occurs during model loading:

2025-08-01T08:01:22.575940Z ERROR core.run_engine_core: EngineCore failed to start.
Traceback (most recent call last):
  File "/opt/vllm/vllm/v1/engine/core.py", line 621, in run_engine_core
    engine_core = DPEngineCoreProc(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/vllm/v1/engine/core.py", line 881, in __init__
    super().__init__(vllm_config, local_client, handshake_address,
  File "/opt/vllm/vllm/v1/engine/core.py", line 441, in __init__
    super().__init__(vllm_config, executor_class, log_stats,
  File "/opt/vllm/vllm/v1/engine/core.py", line 77, in __init__
    self.model_executor = executor_class(vllm_config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/vllm/executor/executor_base.py", line 53, in __init__
    self._init_executor()
  File "/opt/vllm/vllm/executor/uniproc_executor.py", line 49, in _init_executor
    self.collective_rpc("load_model")
  File "/opt/vllm/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/vllm/utils/__init__.py", line 2985, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/vllm/v1/worker/gpu_worker.py", line 201, in load_model
    self.model_runner.load_model(eep_scale_up=eep_scale_up)
  File "/opt/vllm/vllm/v1/worker/gpu_model_runner.py", line 1876, in load_model
    self.model = model_loader.load_model(
                 ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/vllm/model_executor/model_loader/base_loader.py", line 50, in load_model
    process_weights_after_loading(model, model_config, target_device)
  File "/opt/vllm/vllm/model_executor/model_loader/utils.py", line 126, in process_weights_after_loading
    module.process_weights_after_loading(model_config.dtype)
  File "/opt/vllm/vllm/attention/layer.py", line 310, in process_weights_after_loading
    self.impl.process_weights_after_loading(act_dtype)
  File "/opt/vllm/vllm/v1/attention/backends/mla/common.py", line 994, in process_weights_after_loading
    kv_b_proj_weight = get_and_maybe_dequant_weights(self.kv_b_proj).T
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/vllm/v1/attention/backends/mla/common.py", line 983, in get_and_maybe_dequant_weights
    dequant_weights = layer.quant_method.apply(layer,
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/vllm/model_executor/layers/quantization/fp8.py", line 451, in apply
    return torch.ops.vllm.apply_w8a8_block_fp8_linear(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dynamo/venv/lib/python3.12/site-packages/torch/_ops.py", line 1158, in __call__
    return self._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/vllm/model_executor/layers/quantization/utils/fp8_utils.py", line 147, in apply_w8a8_block_fp8_linear
    output = torch.ops.vllm.w8a8_block_fp8_matmul_deepgemm(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dynamo/venv/lib/python3.12/site-packages/torch/_ops.py", line 1158, in __call__
    return self._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/vllm/model_executor/layers/quantization/deepgemm.py", line 58, in w8a8_block_fp8_matmul_deepgemm
    fp8_gemm_nt((A, As), (B, Bs), C)
  File "/opt/vllm/vllm/utils/deep_gemm.py", line 92, in fp8_gemm_nt
    return _fp8_gemm_nt_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Failed: CUDA runtime error csrc/jit/kernel_runtime.hpp:108 '98'

To repro, build the container using the dockerfile provided in the comment, and run the vllm WideEP multinode example:

# node 1
VLLM_ALL2ALL_BACKEND=deepep_low_latency VLLM_USE_DEEP_GEMM=1 \
    vllm serve  deepseek-ai/DeepSeek-R1 \
    --tensor-parallel-size 1 --enable-expert-parallel --data-parallel-size 16 --data-parallel-size-local 8 --data-parallel-address <node1 ip address> \
    --data-parallel-rpc-port 13345 --api-server-count=8 --gpu-memory-utilization 0.95 --max-model-len 10240 --enforce-eager

# node 2
VLLM_ALL2ALL_BACKEND=deepep_low_latency VLLM_USE_DEEP_GEMM=1 \
    vllm serve  deepseek-ai/DeepSeek-R1 \
    --tensor-parallel-size 1 --enable-expert-parallel --data-parallel-size 16 --data-parallel-size-local 8 --data-parallel-address <node1 ip address> \
    --data-parallel-rpc-port 13345 --data-parallel-start-rank 8 --gpu-memory-utilization 0.95 --max-model-len 10240 --enforce-eager --headless

98 means cudaErrorInvalidDeviceFunction: The requested device function does not exist or is not compiled for the proper device architecture. You may use DG_JIT_DEBUG=1 to see whether the compiler arch flag (e.g., 90a for Hopper series) is suitable for your own device.

commented

I encountered the same problem on H20-96G
driver version: 565.57.01
nvcc version: cuda12.9

Using the commit f85ec64 worked from my side. Closing this issue.