[BUG] `matmul` do not support int32 tensors.

Question

[BUG] `matmul` do not support int32 tensors.

AtomicVar opened this issue 7 months ago · comments

Describe the bug
It seems like the cuBLASLt-backed GEMM (matx::matmul) do not support int type, and it just gives wrong result without complaining about unsupported types.

To Reproduce
Steps to reproduce the behavior:

Write a simple GEMM example code that multiplies two int matrices:

#include "matx.h"
#include <cassert>
#include <cstdio>

using namespace matx;

int main() {
  MATX_ENTER_HANDLER();

  index_t M = 2;
  index_t N = 3;

  auto m = make_tensor<int>({M, N});
  auto v = make_tensor<int>({N, 1});

  m.SetVals({{1, 2, 3},
             {4, 5, 6}});
  v.SetVals({{1, 2, 3}});

  auto out = make_tensor<int>({M, 1});

  (out = matmul(m, v)).run();

  cudaStreamSynchronize(0);

  printf("m:\n");
  print(m);
  printf("v:\n");
  print(v);
  printf("out:\n");
  print(out);

  CUDA_CHECK_LAST_ERROR();
  MATX_EXIT_HANDLER();
}

Build the code and run, you will get this output with all-zero result (cuBLASLt as the default backend):

m:
Tensor{int32_t} Rank: 2, Sizes:[2, 3], Strides:[3,1]
000000: 1 2 3 
000001: 4 5 6 
v:
Tensor{int32_t} Rank: 2, Sizes:[3, 1], Strides:[1,1]
000000: 1 
000001: 2 
000002: 3 
out:
Tensor{int32_t} Rank: 2, Sizes:[2, 1], Strides:[1,1]
000000: 0 
000001: 0

Expected behavior
The result should be [14, 32] like float matmul:

m:
Tensor{float} Rank: 2, Sizes:[2, 3], Strides:[3,1]
000000: 1.0000e+00 2.0000e+00 3.0000e+00 
000001: 4.0000e+00 5.0000e+00 6.0000e+00 
v:
Tensor{float} Rank: 2, Sizes:[3, 1], Strides:[1,1]
000000: 1.0000e+00 
000001: 2.0000e+00 
000002: 3.0000e+00 
out:
Tensor{float} Rank: 2, Sizes:[2, 1], Strides:[1,1]
000000: 1.4000e+01 
000001: 3.2000e+01

Code snippers
Listed above.

System details (please complete the following information):

OS: Ubuntu 22.04
CUDA version: 11.8.89
g++ version: 11.4.0

Additional context
I also tried turning on CUTLASS (-DMATX_EN_CUTLASS=ON) and it can support int tensors with some minor fixes.

Currently in MatX, the matmul_impl() uses cuBLASLt as the default GEMM provider and don't have a easy way to switch to CUTLASS, so I changed the template default parameter into PROVIDER_TYPE_CUTLASS:

template <typename TensorTypeC, typename TensorTypeA, typename TensorTypeB, 
-         MatXMatMulProvider_t PROV = PROVIDER_TYPE_CUBLASLT>
+         MatXMatMulProvider_t PROV = PROVIDER_TYPE_CUTLASS>
void matmul_impl(TensorTypeC C, const TensorTypeA A,
            const TensorTypeB B, cudaStream_t stream = 0,
            float alpha = 1.0, float beta = 0.0)
{

Then I changed this in matxMatMulHandle_t::MatMulLaunch to suppress the compiler's float-to-int conversion error:

typename CutlassGemm::Arguments args(
            {static_cast<int>(params_.m), static_cast<int>(params_.n),
             static_cast<int>(params_.k)}, // Gemm Problem dimensions
            {a.Data(),
             static_cast<int>(params_.lda)}, // Tensor-ref for source matrix A
            {b.Data(),
             static_cast<int>(params_.ldb)}, // Tensor-ref for source matrix B
            {c.Data(),
             static_cast<int>(params_.ldc)}, // Tensor-ref for source matrix C
            {c.Data(),
             static_cast<int>(
                 params_.ldc)}, // Tensor-ref for destination matrix D (may be
                                // different memory than source C matrix)
-           {alpha, beta});     // Scalars used in the Epilogue
+           {static_cast<T1>(alpha), static_cast<T1>(beta)});     // Scalars used in the Epilogue

Now the output is normal:

m:
Tensor{int32_t} Rank: 2, Sizes:[2, 3], Strides:[3,1]
000000: 1 2 3 
000001: 4 5 6 
v:
Tensor{int32_t} Rank: 2, Sizes:[3, 1], Strides:[1,1]
000000: 1 
000001: 2 
000002: 3 
out:
Tensor{int32_t} Rank: 2, Sizes:[2, 1], Strides:[1,1]
000000: 14 
000001: 32

My suggestion
If it is not necessary to support int32 GEMM, we can throw an error to the user that it is not supported, and tell him/her to either use the CUTLASS's version or use float32 as the solution.

Cliff Burdick · Answer 1 · Sat Dec 16 2023 22:55:07 GMT+0800 (China Standard Time)

Hi @AtomicVar, thanks for the report. You're right cuBLAS doesn't support it. The reason we disabled CUTLASS is because of the compile times taking extremely long in the unit tests. Even though the functionality and speed were both adequate, the compiles were taking over 40 minutes. We can add a message for this saying it's an unsupported type.

Cliff Burdick · Answer 2 · Sun Dec 17 2023 10:52:22 GMT+0800 (China Standard Time)

@AtomicVar I've submitted MR #540 to resolve this

Cliff Burdick · Answer 3 · Mon Dec 18 2023 08:42:18 GMT+0800 (China Standard Time)

@AtomicVar according to the cublas team we're not meeting the requirements here: https://docs.nvidia.com/cuda/cublas/index.html#cublasltmatmul

Namely A must be transposed.

If that still works for you, we can add support for it when those requirements are met.

AtomicVar · Answer 4 · Mon Dec 18 2023 14:48:29 GMT+0800 (China Standard Time)

@cliffburdick Actually I can use float matmul instead of int matmul to accomplish the same task. So it won't be a problem if int is not supported. I just think the unsupported conditions should be documented and we need to throw and print clear error messages.