NVIDIA / MatX

An efficient C++17 GPU numerical computing library with Python-like syntax

Home Page:https://nvidia.github.io/MatX

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[BUG] `matmul` do not support int32 tensors.

AtomicVar opened this issue · comments

Describe the bug
It seems like the cuBLASLt-backed GEMM (matx::matmul) do not support int type, and it just gives wrong result without complaining about unsupported types.

To Reproduce
Steps to reproduce the behavior:

  1. Write a simple GEMM example code that multiplies two int matrices:
#include "matx.h"
#include <cassert>
#include <cstdio>

using namespace matx;

int main() {
  MATX_ENTER_HANDLER();

  index_t M = 2;
  index_t N = 3;

  auto m = make_tensor<int>({M, N});
  auto v = make_tensor<int>({N, 1});

  m.SetVals({{1, 2, 3},
             {4, 5, 6}});
  v.SetVals({{1, 2, 3}});

  auto out = make_tensor<int>({M, 1});

  (out = matmul(m, v)).run();

  cudaStreamSynchronize(0);

  printf("m:\n");
  print(m);
  printf("v:\n");
  print(v);
  printf("out:\n");
  print(out);

  CUDA_CHECK_LAST_ERROR();
  MATX_EXIT_HANDLER();
}
  1. Build the code and run, you will get this output with all-zero result (cuBLASLt as the default backend):
m:
Tensor{int32_t} Rank: 2, Sizes:[2, 3], Strides:[3,1]
000000: 1 2 3 
000001: 4 5 6 
v:
Tensor{int32_t} Rank: 2, Sizes:[3, 1], Strides:[1,1]
000000: 1 
000001: 2 
000002: 3 
out:
Tensor{int32_t} Rank: 2, Sizes:[2, 1], Strides:[1,1]
000000: 0 
000001: 0

Expected behavior
The result should be [14, 32] like float matmul:

m:
Tensor{float} Rank: 2, Sizes:[2, 3], Strides:[3,1]
000000: 1.0000e+00 2.0000e+00 3.0000e+00 
000001: 4.0000e+00 5.0000e+00 6.0000e+00 
v:
Tensor{float} Rank: 2, Sizes:[3, 1], Strides:[1,1]
000000: 1.0000e+00 
000001: 2.0000e+00 
000002: 3.0000e+00 
out:
Tensor{float} Rank: 2, Sizes:[2, 1], Strides:[1,1]
000000: 1.4000e+01 
000001: 3.2000e+01

Code snippers
Listed above.

System details (please complete the following information):

  • OS: Ubuntu 22.04
  • CUDA version: 11.8.89
  • g++ version: 11.4.0

Additional context
I also tried turning on CUTLASS (-DMATX_EN_CUTLASS=ON) and it can support int tensors with some minor fixes.

Currently in MatX, the matmul_impl() uses cuBLASLt as the default GEMM provider and don't have a easy way to switch to CUTLASS, so I changed the template default parameter into PROVIDER_TYPE_CUTLASS:

template <typename TensorTypeC, typename TensorTypeA, typename TensorTypeB, 
-         MatXMatMulProvider_t PROV = PROVIDER_TYPE_CUBLASLT>
+         MatXMatMulProvider_t PROV = PROVIDER_TYPE_CUTLASS>
void matmul_impl(TensorTypeC C, const TensorTypeA A,
            const TensorTypeB B, cudaStream_t stream = 0,
            float alpha = 1.0, float beta = 0.0)
{

Then I changed this in matxMatMulHandle_t::MatMulLaunch to suppress the compiler's float-to-int conversion error:

typename CutlassGemm::Arguments args(
            {static_cast<int>(params_.m), static_cast<int>(params_.n),
             static_cast<int>(params_.k)}, // Gemm Problem dimensions
            {a.Data(),
             static_cast<int>(params_.lda)}, // Tensor-ref for source matrix A
            {b.Data(),
             static_cast<int>(params_.ldb)}, // Tensor-ref for source matrix B
            {c.Data(),
             static_cast<int>(params_.ldc)}, // Tensor-ref for source matrix C
            {c.Data(),
             static_cast<int>(
                 params_.ldc)}, // Tensor-ref for destination matrix D (may be
                                // different memory than source C matrix)
-           {alpha, beta});     // Scalars used in the Epilogue
+           {static_cast<T1>(alpha), static_cast<T1>(beta)});     // Scalars used in the Epilogue

Now the output is normal:

m:
Tensor{int32_t} Rank: 2, Sizes:[2, 3], Strides:[3,1]
000000: 1 2 3 
000001: 4 5 6 
v:
Tensor{int32_t} Rank: 2, Sizes:[3, 1], Strides:[1,1]
000000: 1 
000001: 2 
000002: 3 
out:
Tensor{int32_t} Rank: 2, Sizes:[2, 1], Strides:[1,1]
000000: 14 
000001: 32

My suggestion
If it is not necessary to support int32 GEMM, we can throw an error to the user that it is not supported, and tell him/her to either use the CUTLASS's version or use float32 as the solution.

Hi @AtomicVar, thanks for the report. You're right cuBLAS doesn't support it. The reason we disabled CUTLASS is because of the compile times taking extremely long in the unit tests. Even though the functionality and speed were both adequate, the compiles were taking over 40 minutes. We can add a message for this saying it's an unsupported type.

@AtomicVar I've submitted MR #540 to resolve this

@AtomicVar according to the cublas team we're not meeting the requirements here: https://docs.nvidia.com/cuda/cublas/index.html#cublasltmatmul

Namely A must be transposed.

If that still works for you, we can add support for it when those requirements are met.

@cliffburdick Actually I can use float matmul instead of int matmul to accomplish the same task. So it won't be a problem if int is not supported. I just think the unsupported conditions should be documented and we need to throw and print clear error messages.