NVIDIA / cutlass

CUDA Templates for Linear Algebra Subroutines

NVIDIA/cutlass Issues

[QST] How to run GEMM with CUDA Graph?
Closed 3 days ago
[BUG] Failing to build on MSVC due to call to _div128
Updated 3 days ago2
[QST] Tiling an MMA in the K dimension
Closed 4 days ago3
[BUG] Circular Dependency in Header Files
Updated 5 days ago
[QST/BUG] Should shared memory usage be checked for multistage pipeline?
Updated 5 days ago1
[DOC] Incorrect link in main README file
Updated 6 days ago
[QST/BUG] why cute kernel transfers so much data between L2 and gmen than cublas kernel
Updated 6 days ago6
[QST]What is the difference between `WmmaTensorOp` and `TensorOp`?
Updated 6 days ago
Int8 multiplication with pytorch extension: namespace "torch" has no member "I8
Updated 6 days ago
How to perform operations like crop, concat on tensors in CuTe? [QST]
Updated 7 days ago2
[QST] GEMM Epilogue Fusion: Row-wise and Column-wise Multiplication
Updated 7 days ago1
[QST] Is there grouped_gemv
Updated 7 days ago
[QST] Equality of shapes
Closed 7 days ago1
[QST] Best way to tell which methods are called
Closed 8 days ago1
[QST] CUTLASS kernels appear to be significantly slower than CuBLAS for an fp16 gemm on `sm_75`
Closed 9 days ago4
[FEA] Add cuTensorMapEncodeTiled to CudaHostAdapter
Updated 10 days ago
[QST] GEMM Epilogue Fusion: Element-wise Ops and Two-Tensor Element-wise Multiplication
Updated 11 days ago7
[QST]Why fp8 convert only has float2fp8 function without ptx ?
Updated 11 days ago1
[BUG] Composition between `Tensor` and `Layout` as shown in `03_tensor.md` does not compile
Updated 11 days ago2
[QST] Epilogue Reduction
Updated 11 days ago1
[BUG] Cutlass Python API silently fails in (suspected) unsupported case
Updated 11 days ago1
[QST] (BUG?)The stride of TensorNCxHWx seems to be confusing when C is smaller than Interleave
Updated 12 days ago
Tiled copy misaligned, how to solve it?
Updated 12 days ago
[QST] use FastLinearCombinationClamp to convert half accumulator to int8_t output?
Updated 12 days ago1
Warp Group MMA vs Warp MMA
Updated 12 days ago1
[BUG] print_layout seems to change the swizzle mode of its input
Closed 12 days ago5
[QST] `TiledMMA` with tiling over a batch dimension?
Closed 12 days ago1
[QST]How to implement different type between D0(D1) and D2 based on 45_dual_gemm example
Updated 13 days ago
two files are included in each other
Updated 13 days ago1
typo in comment
Updated 13 days ago1
close
Closed 14 days ago2
[QST] TiledCopy using `cp.async` as the CopyAtom fails with Layouts created from runtime values.
Closed 14 days ago2
[QST] The best way to do D = func(A x B) x C.
Updated 15 days ago
[QST]The best way to get origin coord from a fragment by cute?
Closed 15 days ago5
[QST] epilogue in HGEMM
Updated 15 days ago
[QST] Hopper mixed precision gemm always worse than FP8
Updated 15 days ago6
[QST] Is there a "DefaultCopy" for transposed int8 matrix in CuTe?
Closed 16 days ago
[QST] In a TiledMMA, why can't C and D be in smem?
Closed 17 days ago4
[QST] 128x32 Tiled Copy with 256 Threads
Closed 18 days ago4
[QST] Is right to read shared mem tensor directly?
Closed 19 days ago6
[QST] performance overhead of indexing into a swizzled tensor
Closed 19 days ago5
[QST] Row major for int8 matrix multiplications?
Updated a month ago1
[QST] Group GEMM with Split-K
Closed a month ago2
[QST] Is it possible to compose two `Swizzle`s with different shift bits?
Closed a month ago2
[QST] `cutlass::Array` and `cute::Tensor` --- using CUTLASS utility structs/classes with CUTE (such as `NumericArrayConverter`)
Updated a month ago
[BUG] FP8 warp specialized gemm failed when m is small
Closed a month ago3
[QST] Understanding the layouts around tensor core ops
Closed a month ago1
[DOC] Atom SM90_64x128x16_F16F16F16F16_TN
Closed a month ago1
[BUG] cutlass::half_t 's max value seems to be 2048
Closed a month ago2
[QST] sm70 and sm80 CuTe examples are tiling ordinary float multiplication?
Closed a month ago1