RuntimeError: Failed: CUDA runtime error csrc/jit/kernel_runtime.hpp:108 '98'
krishung5 opened this issue · comments
Seeing an CUDA runtime error coming from deep_gemm.py when running vLLM WideEP multinode with DeepSeek R1 using commits after #112.
Same issue reported here. It occurs during model loading:
2025-08-01T08:01:22.575940Z ERROR core.run_engine_core: EngineCore failed to start.
Traceback (most recent call last):
File "/opt/vllm/vllm/v1/engine/core.py", line 621, in run_engine_core
engine_core = DPEngineCoreProc(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/vllm/v1/engine/core.py", line 881, in __init__
super().__init__(vllm_config, local_client, handshake_address,
File "/opt/vllm/vllm/v1/engine/core.py", line 441, in __init__
super().__init__(vllm_config, executor_class, log_stats,
File "/opt/vllm/vllm/v1/engine/core.py", line 77, in __init__
self.model_executor = executor_class(vllm_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/vllm/executor/executor_base.py", line 53, in __init__
self._init_executor()
File "/opt/vllm/vllm/executor/uniproc_executor.py", line 49, in _init_executor
self.collective_rpc("load_model")
File "/opt/vllm/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/vllm/utils/__init__.py", line 2985, in run_method
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/vllm/v1/worker/gpu_worker.py", line 201, in load_model
self.model_runner.load_model(eep_scale_up=eep_scale_up)
File "/opt/vllm/vllm/v1/worker/gpu_model_runner.py", line 1876, in load_model
self.model = model_loader.load_model(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/vllm/model_executor/model_loader/base_loader.py", line 50, in load_model
process_weights_after_loading(model, model_config, target_device)
File "/opt/vllm/vllm/model_executor/model_loader/utils.py", line 126, in process_weights_after_loading
module.process_weights_after_loading(model_config.dtype)
File "/opt/vllm/vllm/attention/layer.py", line 310, in process_weights_after_loading
self.impl.process_weights_after_loading(act_dtype)
File "/opt/vllm/vllm/v1/attention/backends/mla/common.py", line 994, in process_weights_after_loading
kv_b_proj_weight = get_and_maybe_dequant_weights(self.kv_b_proj).T
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/vllm/v1/attention/backends/mla/common.py", line 983, in get_and_maybe_dequant_weights
dequant_weights = layer.quant_method.apply(layer,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/vllm/model_executor/layers/quantization/fp8.py", line 451, in apply
return torch.ops.vllm.apply_w8a8_block_fp8_linear(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/dynamo/venv/lib/python3.12/site-packages/torch/_ops.py", line 1158, in __call__
return self._op(*args, **(kwargs or {}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/vllm/model_executor/layers/quantization/utils/fp8_utils.py", line 147, in apply_w8a8_block_fp8_linear
output = torch.ops.vllm.w8a8_block_fp8_matmul_deepgemm(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/dynamo/venv/lib/python3.12/site-packages/torch/_ops.py", line 1158, in __call__
return self._op(*args, **(kwargs or {}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/vllm/model_executor/layers/quantization/deepgemm.py", line 58, in w8a8_block_fp8_matmul_deepgemm
fp8_gemm_nt((A, As), (B, Bs), C)
File "/opt/vllm/vllm/utils/deep_gemm.py", line 92, in fp8_gemm_nt
return _fp8_gemm_nt_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Failed: CUDA runtime error csrc/jit/kernel_runtime.hpp:108 '98'
To repro, build the container using the dockerfile provided in the comment, and run the vllm WideEP multinode example:
# node 1
VLLM_ALL2ALL_BACKEND=deepep_low_latency VLLM_USE_DEEP_GEMM=1 \
vllm serve deepseek-ai/DeepSeek-R1 \
--tensor-parallel-size 1 --enable-expert-parallel --data-parallel-size 16 --data-parallel-size-local 8 --data-parallel-address <node1 ip address> \
--data-parallel-rpc-port 13345 --api-server-count=8 --gpu-memory-utilization 0.95 --max-model-len 10240 --enforce-eager
# node 2
VLLM_ALL2ALL_BACKEND=deepep_low_latency VLLM_USE_DEEP_GEMM=1 \
vllm serve deepseek-ai/DeepSeek-R1 \
--tensor-parallel-size 1 --enable-expert-parallel --data-parallel-size 16 --data-parallel-size-local 8 --data-parallel-address <node1 ip address> \
--data-parallel-rpc-port 13345 --data-parallel-start-rank 8 --gpu-memory-utilization 0.95 --max-model-len 10240 --enforce-eager --headless
98
means cudaErrorInvalidDeviceFunction
: The requested device function does not exist or is not compiled for the proper device architecture. You may use DG_JIT_DEBUG=1
to see whether the compiler arch flag (e.g., 90a
for Hopper series) is suitable for your own device.
I encountered the same problem on H20-96G
driver version: 565.57.01
nvcc version: cuda12.9