benchmark for various type (floatN, intN, halfN, doubleN, shortN) using naive implementation
ysh329 opened this issue · comments
NV Card
Temporarily benchmark on NV Card
- Machine: linux-W580-G20, x86_64-linux-gnu
- Device: TITAN X (Pascal), 12189MiB
- M: 1024; N: 1024; K: 1024
Naive Kernel
- float; 1024 1024 1024; 10; 353.967207 GFLOPS
- int; 1024 1024 1024; 10; 356.647841 GFLOPS
__kernel void mat_mult_naive(const int M, const int N, const int K, __global const CL_INPUT_TYPE *a, __global const CL_INPUT_TYPE *b, __global CL_INPUT_TYPE *c) {
const int col = get_global_id(0);
const int row = get_global_id(1);
CL_ELEM_TYPE res = 0;
for (int p = 0; p < K; p++) {
res += a[row * M + p] * b[p * N + col];
}
c[row * N + col] = res;
}
Mali-T860
Before benmark, set performace mode first:
$ sudo ../tools/ck-print-gpu-freq
*** Current GPU frequency:
800000000
*** Min frequency:
200000000
*** Max frequency:
800000000
*** Available GPU frequencies:
200000000 297000000 400000000 500000000 594000000 800000000
*** Current GPU governor:
performance
*** Available GPU governor:
userspace powersave performance simple_ondemand
Naive kernel
1024 x 1024 x 1024
type | execution time (second) | GFLOPS |
---|---|---|
int32 | 0.5871 | 3.657402 |
float32 | 0.5907 | 3.635007 |
double64 | 0.7542 | 2.847346 |
half16 | 0.5055 | 4.247821 |
float
Stable. Cost 0.590778
seconds and corresponding gflops: 3.635007
GFLOPS.
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: float, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: float, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304
============== CPU RESULT ==============
>>> [INFO] 10 times CPU starting...
0 32.135118
>>> [INFO] skip first 1 time(s)
1 31.683712
2 31.796608
3 32.048721
4 31.663059
5 32.179186
6 31.730488
7 32.137520
8 31.663593
9 31.636981
10 31.600799
>>> [INFO] CPU 1024x1024x1024 31.814067 s 0.067501 GFLOPS
============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./kernel.cl, kernel_func: mat_mult_naive
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 } [8/411]
>>> [INFO] CL_GPU 10 times ./kernel.cl.mat_mult_naive starting ...
0 0.578418
>>> [INFO] skip first 1 time(s)
1 0.574513
2 0.624397
3 0.575214
4 0.578982
5 0.575077
6 0.608436
7 0.575884
8 0.624576
9 0.594593
10 0.576112
gflops: 2147483648.000000
ave_duration: 0.590778
>>> [INFO] CL_GPU 1024x1024x1024 0.590778 s 3.635007 GFLOPS
>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b
int
The benchmark result is not stable: naive matrix multiplication costs 0.58~0.77
seconds, corresponding gflops between 2.9~3.5
GFLOPS.
$ ./matrixMultiplication 1024 1024 1024 ./kernelmat_mult_naive 5 1024 1024 1 [10/70]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: int, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: int, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304
============== CPU RESULT ==============
>>> [INFO] 5 times CPU starting...
0 49.456371
>>> [INFO] skip first 1 time(s)
1 49.445682
2 51.025633
3 49.424161
4 50.706564
5 50.874110
>>> [INFO] CPU 1024x1024x1024 50.295230 s 0.042698 GFLOPS
============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./kernel.cl, kernel_func: mat_mult_naive
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 5 times ./kernel.cl.mat_mult_naive starting ...
0 0.666985
>>> [INFO] skip first 1 time(s)
1 0.628585
2 0.577195
3 0.577648
4 0.576896
5 0.575481
gflops: 2147483648.000000
ave_duration: 0.587161
>>> [INFO] CL_GPU 1024x1024x1024 0.587161 s 3.657402 GFLOPS
>>> [TEST] correct rate: 1.0000
half
Stable. Cost 0.550
seconds and corresponding gflops: 4.21~4.24
GFLOPS.
$ ./matrixMultiplication 1024 1024 1024 ./kernel.cl mat_mult_naive 10 1024 1024 1
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: __fp16, sizeof(ELEM_TYPE): 2
>>> [INFO] CL_ELEM_TYPE_STR: half, sizeof(CL_ELEM_TYPE): 2
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 2097152, data_size_b: 2097152, data_size_c: 2097152
============== CPU RESULT ==============
>>> [INFO] 10 times CPU starting...
0 22.410335
>>> [INFO] skip first 1 time(s)
1 22.575216
2 22.340271
3 22.540361
4 22.486432
5 22.455014
6 22.280294
7 22.382801
8 22.492505
9 22.183580
10 22.428460
>>> [INFO] CPU 1024x1024x1024 22.416493 s 0.095799 GFLOPS
============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./kernel.cl, kernel_func: mat_mult_naive
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 10 times ./kernel.cl.mat_mult_naive starting ...
0 0.511446
>>> [INFO] skip first 1 time(s)
1 0.506768
2 0.504923
3 0.505440
4 0.505335
5 0.505887
6 0.504761
7 0.505133
8 0.505742
9 0.506590
10 0.504916
gflops: 2147483648.000000
ave_duration: 0.505549
>>> [INFO] CL_GPU 1024x1024x1024 0.505549 s 4.247821 GFLOPS
>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b
double
============== INIT =============
>>> [INFO] ELEM_TYPE_STR: double, sizeof(ELEM_TYPE): 8
>>> [INFO] CL_ELEM_TYPE_STR: double, sizeof(CL_ELEM_TYPE): 8
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 8388608, data_size_b: 8388608, data_size_c: 8388608
============== CPU RESULT ==============
>>> [INFO] 10 times CPU starting...
0 62.132808
>>> [INFO] skip first 1 time(s)
1 62.124912
2 62.114249
3 62.107852
4 62.124306
5 62.125422
6 62.110446
7 62.129060
8 62.106102
9 63.081955
10 65.068961
>>> [INFO] CPU 1024x1024x1024 62.509327 s 0.034355 GFLOPS
============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./kernel.cl, kernel_func: mat_mult_naive
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 10 times ./kernel.cl.mat_mult_naive starting ...
0 0.758725
>>> [INFO] skip first 1 time(s)
1 0.759127
2 0.752441
3 0.737415
4 0.738173
5 0.736296
6 0.790745
7 0.796274
8 0.735750
9 0.741224
10 0.754608
gflops: 2147483648.000000
ave_duration: 0.754205
>>> [INFO] CL_GPU 1024x1024x1024 0.754205 s 2.847346 GFLOPS
>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b
float2
$ ./matrixMultiplication 1024 1024 1024 ./kernel.cl mat_mult_vec2 0 20 $[1024] $[1024] 1 [21/825]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: float, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: float2, sizeof(CL_ELEM_TYPE): 8
>>> [WARN] ELEM_TYPE(float) size differs from CL_ELEM_TYPE(float2)
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304
============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0 32.072621
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS
============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./kernel.cl, kernel_func: mat_mult_vec2
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./kernel.cl.mat_mult_vec2 starting ...
0 0.745514
>>> [INFO] skip first 1 time(s)
1 0.664003
2 0.682694
3 0.705091
4 0.664341
5 0.665731
6 0.665375
7 0.663946
8 0.665404
9 0.704201
10 0.678438
11 0.699856
12 0.740078
13 0.664164
14 0.681739
15 0.662163
16 0.664076
17 0.664025
18 0.669523
19 0.690585
20 0.665842
>>> [INFO] CL_GPU 1024x1024x1024 0.678064 s 3.167082 GFLOPS
>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b
int2
$ ./matrixMultiplication 1024 1024 1024 ./kernel.cl mat_mult_vec2 0 20 $[1024] $[1024] 1 [21/929]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: int, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: int2, sizeof(CL_ELEM_TYPE): 8
>>> [WARN] ELEM_TYPE(int) size differs from CL_ELEM_TYPE(int2)
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304
============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0 42.547083
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS
============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./kernel.cl, kernel_func: mat_mult_vec2
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./kernel.cl.mat_mult_vec2 starting ...
0 0.729240
>>> [INFO] skip first 1 time(s)
1 0.663346
2 0.663381
3 0.665677
4 0.688709
5 0.667141
6 0.675088
7 0.668971
8 0.677944
9 0.677716
10 0.662854
11 0.683053
12 0.663949
13 0.707164
14 0.664728
15 0.667829
16 0.669991
17 0.661924
18 0.691719
19 0.660779
20 0.726561
>>> [INFO] CL_GPU 1024x1024x1024 0.675426 s 3.179450 GFLOPS
>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b
half2
$ ./matrixMultiplication 1024 1024 1024 ./kernel.cl mat_mult_vec2 0 20 $[1024] $[1024] 1 [20/1029]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: __fp16, sizeof(ELEM_TYPE): 2
>>> [INFO] CL_ELEM_TYPE_STR: half2, sizeof(CL_ELEM_TYPE): 2
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 2097152, data_size_b: 2097152, data_size_c: 2097152
============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0 21.096589
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS
============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./kernel.cl, kernel_func: mat_mult_vec2
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./kernel.cl.mat_mult_vec2 starting ...
0 0.697708
>>> [INFO] skip first 1 time(s)
1 0.598590
2 0.598986
3 0.599458
4 0.598561
5 0.598886
6 0.600000
7 0.600150
8 0.603970
9 0.600956
10 0.601270
11 0.597891
12 0.598698
13 0.598759
14 0.599476
15 0.598256
16 0.599068
17 0.598711
18 0.599195
19 0.599339
20 0.599145
>>> [INFO] CL_GPU 1024x1024x1024 0.599468 s 3.582314 GFLOPS
>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b
half4
$ ./matrixMultiplication 1024 1024 1024 ./vec4.cl mat_mult_vec4 0 20 $[1024] $[1024] 1
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: __fp16, sizeof(ELEM_TYPE): 2
>>> [INFO] CL_ELEM_TYPE_STR: half4, sizeof(CL_ELEM_TYPE): 2
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 2097152, data_size_b: 2097152, data_size_c: 2097152
============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0 24.388975
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS
============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./vec4.cl, kernel_func: mat_mult_vec4
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./vec4.cl.mat_mult_vec4 starting ...
0 0.785168
>>> [INFO] skip first 1 time(s)
1 0.785488
2 0.785155
3 0.784971
4 0.785315
5 0.784797
6 0.784791
7 0.785390
8 0.785067
9 0.784999
10 0.784745
11 0.786073
12 0.784571
13 0.786385
14 0.784829
15 0.785454
16 0.784674
17 0.784963
18 0.786262
19 0.784866
20 0.785740
>>> [INFO] CL_GPU 1024x1024x1024 0.785227 s 2.734858 GFLOPS
>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b
int4
$ ./matrixMultiplication 1024 1024 1024 ./vec4.cl mat_mult_vec4 0 20 $[1024] $[1024] 1 [22/1482]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: int, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: int4, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304
============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0 50.925559
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS
============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./vec4.cl, kernel_func: mat_mult_vec4
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./vec4.cl.mat_mult_vec4 starting ...
0 1.227366
>>> [INFO] skip first 1 time(s)
1 1.232017
2 1.230932
3 1.227194
4 1.231709
5 1.232883
6 1.229543
7 1.228786
8 1.229696
9 1.227157
10 1.227646
11 1.227292
12 1.227880
13 1.228065
14 1.231296
15 1.226779
16 1.227354
17 1.227589
18 1.227081
19 1.229803
20 1.231007
>>> [INFO] CL_GPU 1024x1024x1024 1.229085 s 1.747221 GFLOPS
>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b
float4
$ ./matrixMultiplication 1024 1024 1024 ./vec4.cl mat_mult_vec4 0 20 $[1024] $[1024] 1 [22/1383]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: float, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: float4, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304
============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0 33.002391
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS
============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./vec4.cl, kernel_func: mat_mult_vec4
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./vec4.cl.mat_mult_vec4 starting ...
0 1.252484
>>> [INFO] skip first 1 time(s)
1 1.250876
2 1.249836
3 1.249639
4 1.224961
5 1.238079
6 1.249810
7 1.250827
8 1.253317
9 1.250303
10 1.248669
11 1.250466
12 1.249963
13 1.249999
14 1.249263
15 1.248959
16 1.250244
17 1.249647
18 1.249784
19 1.249077
20 1.248922
>>> [INFO] CL_GPU 1024x1024x1024 1.248132 s 1.720558 GFLOPS
>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b
int8
$ ./matrixMultiplication 1024 1024 1024 ./vec8.cl mat_mult_vec8 0 20 $[1024] $[1024] 1 [22/1585]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: int, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: int8, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304
============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0 48.183118
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS
============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./vec8.cl, kernel_func: mat_mult_vec8
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./vec8.cl.mat_mult_vec8 starting ...
0 1.245643
>>> [INFO] skip first 1 time(s)
1 1.247249
2 1.246787
3 1.247569
4 1.247150
5 1.246996
6 1.248807
7 1.246896
8 1.247675
9 1.248380
10 1.245348
11 1.249068
12 1.248908
13 1.248970
14 1.247565
15 1.251570
16 1.248810
17 1.250007
18 1.247615
19 1.248424
20 1.248822
>>> [INFO] CL_GPU 1024x1024x1024 1.248131 s 1.720560 GFLOPS
>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b
half8
$ ./matrixMultiplication 1024 1024 1024 ./vec8.cl mat_mult_vec8 0 20 $[1024] $[1024] 1 [22/1684]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: __fp16, sizeof(ELEM_TYPE): 2
>>> [INFO] CL_ELEM_TYPE_STR: half8, sizeof(CL_ELEM_TYPE): 2
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 2097152, data_size_b: 2097152, data_size_c: 2097152
============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0 21.785163
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS
============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./vec8.cl, kernel_func: mat_mult_vec8
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./vec8.cl.mat_mult_vec8 starting ...
0 1.064921
>>> [INFO] skip first 1 time(s)
1 1.060214
2 1.063404
3 1.066602
4 1.067873
5 1.063292
6 1.058918
7 1.064117
8 1.061240
9 1.066587
10 1.062076
11 1.065420
12 1.063350
13 1.059808
14 1.066744
15 1.065193
16 1.062735
17 1.067862
18 1.066636
19 1.063445
20 1.067638
>>> [INFO] CL_GPU 1024x1024x1024 1.064158 s 2.018013 GFLOPS
>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b
float8
$ ./matrixMultiplication 1024 1024 1024 ./vec8.cl mat_mult_vec8 0 20 $[1024] $[1024] 1 [22/1784]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: float, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: float8, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304
============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0 32.317702
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS
============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./vec8.cl, kernel_func: mat_mult_vec8
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./vec8.cl.mat_mult_vec8 starting ...
0 1.227704
>>> [INFO] skip first 1 time(s)
1 1.225342
2 1.226772
3 1.223779
4 1.225490
5 1.226510
6 1.226443
7 1.227413
8 1.225230
9 1.222004
10 1.226666
11 1.225248
12 1.224709
13 1.223577
14 1.224604
15 1.225972
16 1.225790
17 1.225214
18 1.224669
19 1.225131
20 1.225476
>>> [INFO] CL_GPU 1024x1024x1024 1.225302 s 1.752616 GFLOPS
>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b
float16
$ ./matrixMultiplication 1024 1024 1024 ./vec16.cl mat_mult_vec16 0 20 $[1024] $[1024] 1 [22/243]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: float, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: float16, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304
============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0 32.313083
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS
============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./vec16.cl, kernel_func: mat_mult_vec16
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./vec16.cl.mat_mult_vec16 starting ...
0 1.584943
>>> [INFO] skip first 1 time(s)
1 1.570398
2 1.582975
3 1.572269
4 1.585434
5 1.571658
6 1.585436
7 1.572811
8 1.583057
9 1.572311
10 1.585094
11 1.573029
12 1.588597
13 1.572048
14 1.585594
15 1.571509
16 1.588572
17 1.571195
18 1.588734
19 1.578679
20 1.586458
>>> [INFO] CL_GPU 1024x1024x1024 1.579293 s 1.359775 GFLOPS
>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b
half16
$ ./matrixMultiplication 1024 1024 1024 ./vec16.cl mat_mult_vec16 0 20 $[1024] $[1024] 1 [22/297]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: __fp16, sizeof(ELEM_TYPE): 2
>>> [INFO] CL_ELEM_TYPE_STR: half16, sizeof(CL_ELEM_TYPE): 2
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 2097152, data_size_b: 2097152, data_size_c: 2097152
============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0 21.933312
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS
============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./vec16.cl, kernel_func: mat_mult_vec16
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./vec16.cl.mat_mult_vec16 starting ...
0 1.113329
>>> [INFO] skip first 1 time(s)
1 1.109171
2 1.112040
3 1.118125
4 1.108358
5 1.112967
6 1.115943
7 1.109845
8 1.109008
9 1.110358
10 1.110518
11 1.105643
12 1.113373
13 1.110546
14 1.109875
15 1.106689
16 1.116894
17 1.114391
18 1.115844
19 1.108951
20 1.108160
>>> [INFO] CL_GPU 1024x1024x1024 1.111335 s 1.932346 GFLOPS
int16
$ ./matrixMultiplication 1024 1024 1024 ./vec16.cl mat_mult_vec16 0 20 $[1024] $[1024] 1 [22/351]
============== INIT ==============
>>> [INFO] ELEM_TYPE_STR: int, sizeof(ELEM_TYPE): 4
>>> [INFO] CL_ELEM_TYPE_STR: int16, sizeof(CL_ELEM_TYPE): 4
>>> [INFO] len_a: 1048576, len_b: 1048576, len_c: 1048576
>>> [INFO] data_size_a: 4194304, data_size_b: 4194304, data_size_c: 4194304
============== CPU RESULT ==============
>>> [INFO] 0 times CPU starting...
0 44.288010
>>> [INFO] skip first 1 time(s)
>>> [INFO] CPU 1024x1024x1024 nan s nan GFLOPS
============== GPU RESULT ==============
>>> [INFO] Device name: Mali-T86x MP4 r2p0 0x0860
>>> [INFO] program_file: ./vec16.cl, kernel_func: mat_mult_vec16
>>> [INFO] global_work_size[3]: { 1024, 1024, 1 }
>>> [INFO] CL_GPU 20 times ./vec16.cl.mat_mult_vec16 starting ...
0 1.560015
>>> [INFO] skip first 1 time(s)
1 1.549140
2 1.556859
3 1.546036
4 1.557132
5 1.550042
6 1.560190
7 1.551438
8 1.563459
9 1.548408
10 1.557484
11 1.551499
12 1.558829
13 1.551887
14 1.558873
15 1.549608
16 1.557619
17 1.553162
18 1.558806
19 1.553744
20 1.558716
>>> [INFO] CL_GPU 1024x1024x1024 1.554647 s 1.381332 GFLOPS
>>> [TEST] correct rate: 1.0000
>>> [TEST] ~ Bingo ~ matrix a == matrix b
The scale of matrix multiplication is 1024 x 1024 x 1024, and task size equals to global work size situation (unit is seconds):
int 0.587161
int2 0.675426
int4 1.229085
int8 1.248131
int16 1.554647
float 0.590778
float2 0.678064
float4 1.248132
float8 1.225302
float16 1.579293
half 0.505549
half2 0.599468
half4 0.785227
half8 1.064158
half16 1.111335