Follow CPU GEMM optimization guide
ysh329 opened this issue · comments
ysh329 commented
Optimizaton0: naive implementation
taskset -c 0
$ taskset -c 0 ./matrixMultiplication 1024 1024 1024 ./opt.cl mat_mult_naive 0 10 $[1024] 1024 1 [14/374]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: float, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: float, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304
============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0 232.787749
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS
============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./opt.cl, kernel_func: mat_mult_naive
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 10 times ./opt.cl.mat_mult_naive starting ...
0 0.801968
>>> [INFO] skip first 1 time(s)
1 0.668630
2 0.677941
3 0.718801
4 0.808120
5 0.666343
6 0.787792
7 0.870304
8 0.713615
9 0.671040
10 0.802979
>>> [INFO] CL_GPU 1024x1024x1024 0.738556 s 2.907677 GFLOPS
>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b
taskset -c 4
$ taskset -c 4 ./matrixMultiplication 1024 1024 1024 ./opt.cl mat_mult_naive 0 10 $[1024] 1024 1 [50/375]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: float, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: float, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304
============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0 32.196788
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS
============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./opt.cl, kernel_func: mat_mult_naive
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 10 times ./opt.cl.mat_mult_naive starting ...
0 0.755578
>>> [INFO] skip first 1 time(s)
1 0.829920
2 0.667722
3 0.861194
4 0.828542
5 0.738687
6 0.698710
7 0.704991
8 0.705750
9 0.789136
10 0.859732
>>> [INFO] CL_GPU 1024x1024x1024 0.768438 s 2.794607 GFLOPS
>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b
ysh329 commented
Don't support register declaration of variables in OpenCL. Some people said register variables are not recommended. I followed this guide to gemm_4x4_11 and stopped during section of blocking and packed.