Fast CUTLASS GEMM from scratch

Step-by-step optimization of matrix multiplication, implemented with the Nvidia CUTLASS C++ template library.

Building

git submodule update --init --recursive
make

make bench

make test

The cuBLAS library is only needed to compile its benchmark implementation. But if it is not present, the code still compiles and runs.

Changing NN to NT in cuBLAS gives ~27% speedup in the original CUDA-MMM code.

Using half precision gives up to 2x memory bandwidth and compute.

Comparison of CUTLASS GEMM implementations

GNU Affero General Public License v3.0

Language:C++ 55.7%Language:Cuda 25.9%Language:C 10.6%Language:CMake 6.5%Language:Makefile 1.2%