Giters
NVIDIA
/
cutlass
CUDA Templates for Linear Algebra Subroutines
Geek Repo:
Geek Repo
Github PK Tool:
Github PK Tool
Stargazers:
4740
Watchers:
107
Issues:
898
Forks:
820
NVIDIA/cutlass Issues
[QST] How to run GEMM with CUDA Graph?
Closed
3 days ago
[BUG] Failing to build on MSVC due to call to _div128
Updated
3 days ago
Comments count
2
[QST] Tiling an MMA in the K dimension
Closed
4 days ago
Comments count
3
[BUG] Circular Dependency in Header Files
Updated
5 days ago
[QST/BUG] Should shared memory usage be checked for multistage pipeline?
Updated
5 days ago
Comments count
1
[DOC] Incorrect link in main README file
Updated
6 days ago
[QST/BUG] why cute kernel transfers so much data between L2 and gmen than cublas kernel
Updated
6 days ago
Comments count
6
[QST]What is the difference between `WmmaTensorOp` and `TensorOp`?
Updated
6 days ago
Int8 multiplication with pytorch extension: namespace "torch" has no member "I8
Updated
6 days ago
How to perform operations like crop, concat on tensors in CuTe? [QST]
Updated
7 days ago
Comments count
2
[QST] GEMM Epilogue Fusion: Row-wise and Column-wise Multiplication
Updated
7 days ago
Comments count
1
[QST] Is there grouped_gemv
Updated
7 days ago
[QST] Equality of shapes
Closed
7 days ago
Comments count
1
[QST] Best way to tell which methods are called
Closed
8 days ago
Comments count
1
[QST] CUTLASS kernels appear to be significantly slower than CuBLAS for an fp16 gemm on `sm_75`
Closed
9 days ago
Comments count
4
[FEA] Add cuTensorMapEncodeTiled to CudaHostAdapter
Updated
10 days ago
[QST] GEMM Epilogue Fusion: Element-wise Ops and Two-Tensor Element-wise Multiplication
Updated
11 days ago
Comments count
7
[QST]Why fp8 convert only has float2fp8 function without ptx ?
Updated
11 days ago
Comments count
1
[BUG] Composition between `Tensor` and `Layout` as shown in `03_tensor.md` does not compile
Updated
11 days ago
Comments count
2
[QST] Epilogue Reduction
Updated
11 days ago
Comments count
1
[BUG] Cutlass Python API silently fails in (suspected) unsupported case
Updated
11 days ago
Comments count
1
[QST] (BUG?)The stride of TensorNCxHWx seems to be confusing when C is smaller than Interleave
Updated
12 days ago
Tiled copy misaligned, how to solve it?
Updated
12 days ago
[QST] use FastLinearCombinationClamp to convert half accumulator to int8_t output?
Updated
12 days ago
Comments count
1
Warp Group MMA vs Warp MMA
Updated
12 days ago
Comments count
1
[BUG] print_layout seems to change the swizzle mode of its input
Closed
12 days ago
Comments count
5
[QST] `TiledMMA` with tiling over a batch dimension?
Closed
12 days ago
Comments count
1
[QST]How to implement different type between D0(D1) and D2 based on 45_dual_gemm example
Updated
13 days ago
two files are included in each other
Updated
13 days ago
Comments count
1
typo in comment
Updated
13 days ago
Comments count
1
close
Closed
14 days ago
Comments count
2
[QST] TiledCopy using `cp.async` as the CopyAtom fails with Layouts created from runtime values.
Closed
14 days ago
Comments count
2
[QST] The best way to do D = func(A x B) x C.
Updated
15 days ago
[QST]The best way to get origin coord from a fragment by cute?
Closed
15 days ago
Comments count
5
[QST] epilogue in HGEMM
Updated
15 days ago
[QST] Hopper mixed precision gemm always worse than FP8
Updated
15 days ago
Comments count
6
[QST] Is there a "DefaultCopy" for transposed int8 matrix in CuTe?
Closed
16 days ago
[QST] In a TiledMMA, why can't C and D be in smem?
Closed
17 days ago
Comments count
4
[QST] 128x32 Tiled Copy with 256 Threads
Closed
18 days ago
Comments count
4
[QST] Is right to read shared mem tensor directly?
Closed
19 days ago
Comments count
6
[QST] performance overhead of indexing into a swizzled tensor
Closed
19 days ago
Comments count
5
[QST] Row major for int8 matrix multiplications?
Updated
a month ago
Comments count
1
[QST] Group GEMM with Split-K
Closed
a month ago
Comments count
2
[QST] Is it possible to compose two `Swizzle`s with different shift bits?
Closed
a month ago
Comments count
2
[QST] `cutlass::Array` and `cute::Tensor` --- using CUTLASS utility structs/classes with CUTE (such as `NumericArrayConverter`)
Updated
a month ago
[BUG] FP8 warp specialized gemm failed when m is small
Closed
a month ago
Comments count
3
[QST] Understanding the layouts around tensor core ops
Closed
a month ago
Comments count
1
[DOC] Atom SM90_64x128x16_F16F16F16F16_TN
Closed
a month ago
Comments count
1
[BUG] cutlass::half_t 's max value seems to be 2048
Closed
a month ago
Comments count
2
[QST] sm70 and sm80 CuTe examples are tiling ordinary float multiplication?
Closed
a month ago
Comments count
1
Previous
Next