Triton Error [CUDA]: device kernel image is invalid

Question

Triton Error [CUDA]: device kernel image is invalid

zhyncs opened this issue 14 days ago · comments

Hi @ptillet @Jokeren @ThomasRaoux @jlebar

env

sys.platform: linux
Python: 3.9.16 (main, Aug 15 2023, 19:38:56) [GCC 8.3.1 20190311 (Red Hat 8.3.1-3)]
CUDA available: True
MUSA available: False
GPU 0,1: NVIDIA A100-SXM4-80GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.8, V11.8.89
GCC: gcc (GCC) 10.2.1 20210130 (Red Hat 10.2.1-11)
PyTorch: 2.2.2+cu118
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.8
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2  (built against CUDA 12.1)
    - Built with CuDNN 8.7
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

transformers: 4.40.0
pydantic: 2.6.0
triton: 2.2.0

reproduce

import torch
import triton
import triton.language as tl


@triton.jit
def _add_kernel(A, B, C, size, BLOCK: tl.constexpr):
    prog_id = tl.program_id(0)
    offs = prog_id * BLOCK + tl.arange(0, BLOCK)
    a = tl.load(A + offs, mask=offs < size)
    b = tl.load(B + offs, mask=offs < size)
    tl.store(C + offs, a + b, mask=offs < size)


def custom_add(a, b):
    c = torch.empty_like(a)
    size = c.size(0)
    BLOCK = 16

    grid = [triton.cdiv(size, BLOCK)]
    _add_kernel[grid](a, b, c, size, BLOCK=BLOCK)
    return c


def check_env_triton():
    try:
        a = torch.tensor([1, 2], device='cuda')
        b = a.new_tensor([3, 4], device='cuda')
        c = custom_add(a, b)
    except Exception as e:
        print(e)


check_env_triton()

error

Triton Error [CUDA]: device kernel image is invalid

If you need more detailed information, please feel free to contact me at any time. Thanks.

zhyncs · Answer 1 · Thu Jun 20 2024 13:08:04 GMT+0800 (China Standard Time)

ref InternLM/lmdeploy#1621 (comment)