ysh329 / OpenCL-101

Learn OpenCL step by step.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Follow CPU GEMM optimization guide

ysh329 opened this issue · comments

Optimizaton0: naive implementation

taskset -c 0

$ taskset -c 0 ./matrixMultiplication 1024 1024 1024 ./opt.cl mat_mult_naive 0 10 $[1024] 1024 1                                                                                                                                                                           [14/374]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: float, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: float, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304

============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0        232.787749
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS

============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./opt.cl, kernel_func: mat_mult_naive
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 10 times ./opt.cl.mat_mult_naive starting ...
0        0.801968
>>> [INFO] skip first 1 time(s)
1        0.668630
2        0.677941
3        0.718801
4        0.808120
5        0.666343
6        0.787792
7        0.870304
8        0.713615
9        0.671040
10       0.802979
>>> [INFO] CL_GPU 1024x1024x1024 0.738556 s 2.907677 GFLOPS

>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b

taskset -c 4

$ taskset -c 4 ./matrixMultiplication 1024 1024 1024 ./opt.cl mat_mult_naive 0 10 $[1024] 1024 1                                                                                                                                                                           [50/375]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: float, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: float, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304

============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0        32.196788
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS

============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./opt.cl, kernel_func: mat_mult_naive
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 10 times ./opt.cl.mat_mult_naive starting ...
0        0.755578
>>> [INFO] skip first 1 time(s)
1        0.829920
2        0.667722
3        0.861194
4        0.828542
5        0.738687
6        0.698710
7        0.704991
8        0.705750
9        0.789136
10       0.859732
>>> [INFO] CL_GPU 1024x1024x1024 0.768438 s 2.794607 GFLOPS

>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b

Don't support register declaration of variables in OpenCL. Some people said register variables are not recommended. I followed this guide to gemm_4x4_11 and stopped during section of blocking and packed.