Several common methods of matrix multiplication are implemented on CPU and Nvidia GPU using C++11 and CUDA. The performance benefits of each optimization method were simply tested.
- naive
- reordering
- tiling
- strassen
- coppersmith-winograd
- cublas
- naive
- kahan
- shared_memory
- OS: Linux
- Cmake Version: >= 3.8
- GCC Version: >= 4.8
- CUDA Version: 11.4 (best)
- CUDA Driver Version: 470.129.06 (best)
git clone https://github.com/Bruce-Lee-LY/matrix_multiply.git
cd matrix_multiply
./build.sh -t Release -b OFF
./build.sh -t Debug -b ON
- OS: Ubuntu 20.04.4
- CPU: i5-9400F
- GPU: NVIDIA GeForce GTX 1080 Ti
- CUDA Version: 11.4
- CUDA Driver Version: 470.129.06
- Matrix (float): A (512 * 512) * B (512 * 512) = C (512 * 512)
Method |
Cost / ms |
naive |
1238.647 |
reordering |
984.445 |
tiling |
1000.095 |
strassen |
57429.407 |
coppersmith-winograd |
77668.238 |
Method |
Cost / ms |
cublas |
0.100 |
naive |
0.613 |
kahan |
0.616 |
shared_memory |
0.153 |