umar456 / arrayfire-benchmarks

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

raw addition vs tiled addition

WilliamTambellini opened this issue · comments

Hi @umar456
When doing a tiled addition
C = A + tile(B, 1, A.dims(1), A.dims(2)); // B being a 1d vector
vs a raw non tiled addition
C = A + B; // A and B being same shape
I would expect afcuda to use less mem but looks like it is not.

Would you know why ?

batched_add.cpp.txt

Thanks for the benchmark! Looking into it now.

The allocations you are seeing is from the C array which is allocated in the benchmark. This array is the same size in both cases. Also, you are calling the addBase function in both cases. If you fix the tiled benchmark declaration you should see the expected behaviour.

Please feel free to create a PR to add this benchmark. I can also do it if you don't want to.

tile.txt

Cool, thanks.

$ ./batched_add
2018-11-28 12:12:31
Running ./batched_add
Run on (8 X 3500 MHz CPU s)
CPU Caches:
L1 Data 32K (x4)
L1 Instruction 32K (x4)
L2 Unified 256K (x4)
L3 Unified 6144K (x1)
WARNING CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
WARNING Library was built as DEBUG. Timings may be affected.
Context:
ArrayFire v3.6.2 (dc38ef1) (Backend: CUDA Platform: CUDA)
Device: GeForce_GTX_1060 (6.1 v9.2)

Benchmark Time CPU Iterations alloc_megabytes

A[N,N,batch]+B[N,N,batch]/f32/N:512/batch:2 53 us 53 us 12797 0
A[N,N,batch]+B[N,N,batch]/f32/N:1024/batch:2 176 us 176 us 3915 0
A[N,N,batch]+B[N,N,batch]/f32/N:2048/batch:2 665 us 665 us 987 0
A[N,N,batch]+B[N,N,batch]/f32/N:512/batch:4 94 us 94 us 7132 0
A[N,N,batch]+B[N,N,batch]/f32/N:1024/batch:4 338 us 338 us 2079 0
A[N,N,batch]+B[N,N,batch]/f32/N:2048/batch:4 1312 us 1312 us 519 0
A[N,N,batch]+B[N,N,batch]/f32/N:512/batch:8 175 us 175 us 3914 0
A[N,N,batch]+B[N,N,batch]/f32/N:1024/batch:8 666 us 666 us 967 0
A[N,N,batch]+B[N,N,batch]/f32/N:2048/batch:8 2617 us 2617 us 269 0
A[N,N,batch]+B[N,N,batch]/f32/N:512/batch:16 375 us 375 us 2063 0
A[N,N,batch]+B[N,N,batch]/f32/N:1024/batch:16 1313 us 1313 us 515 0
A[N,N,batch]+B[N,N,batch]/f32/N:2048/batch:16 5217 us 5217 us 124 0
A[N,N,batch]+B[N,N,batch]/f32/N:512/batch:32 663 us 663 us 994 0
A[N,N,batch]+B[N,N,batch]/f32/N:1024/batch:32 2631 us 2631 us 269 0
A[N,N,batch]+B[N,N,batch]/f32/N:2048/batch:32 10413 us 10413 us 68 0
A[N,N,batch]+B[N,N,batch]/f32/N:512/batch:64 1311 us 1311 us 528 0
A[N,N,batch]+B[N,N,batch]/f32/N:1024/batch:64 5218 us 5218 us 124 0
A[N,N,batch]+B[N,N,batch]/f32/N:2048/batch:64 20846 us 20846 us 33 0
A[N,N,batch]+tile(B)/f32/N:512/batch:2 86 us 86 us 6000 0
A[N,N,batch]+tile(B)/f32/N:1024/batch:2 302 us 302 us 2328 0
A[N,N,batch]+tile(B)/f32/N:2048/batch:2 1180 us 1180 us 577 0
A[N,N,batch]+tile(B)/f32/N:512/batch:4 158 us 158 us 4323 0
A[N,N,batch]+tile(B)/f32/N:1024/batch:4 594 us 594 us 1098 0
A[N,N,batch]+tile(B)/f32/N:2048/batch:4 2339 us 2339 us 295 0
A[N,N,batch]+tile(B)/f32/N:512/batch:8 301 us 301 us 2304 0
A[N,N,batch]+tile(B)/f32/N:1024/batch:8 1183 us 1183 us 573 0
A[N,N,batch]+tile(B)/f32/N:2048/batch:8 4685 us 4685 us 134 0
A[N,N,batch]+tile(B)/f32/N:512/batch:16 598 us 598 us 1073 0
A[N,N,batch]+tile(B)/f32/N:1024/batch:16 2347 us 2347 us 298 0
A[N,N,batch]+tile(B)/f32/N:2048/batch:16 9391 us 9391 us 71 0
A[N,N,batch]+tile(B)/f32/N:512/batch:32 1182 us 1182 us 575 0
A[N,N,batch]+tile(B)/f32/N:1024/batch:32 4693 us 4693 us 135 0
A[N,N,batch]+tile(B)/f32/N:2048/batch:32 19019 us 19019 us 36 0
A[N,N,batch]+tile(B)/f32/N:512/batch:64 2366 us 2366 us 298 0
A[N,N,batch]+tile(B)/f32/N:1024/batch:64 9403 us 9403 us 71 0
A[N,N,batch]+tile(B)/f32/N:2048/batch:64 38368 us 38369 us 18 0

So tiled add seems twice slower than non tiled : is it expected ?

Note: you can merge that little bench in this repo if you d like to.

This is what I am getting on the GV100:

Context:
ArrayFire v3.7.0 (b8daf96a) (Backend: CUDA Platform: CUDA)
Device: Quadro_GV100 (7.0 v10)
-----------------------------------------------------------------------------------------------------
Benchmark                                              Time           CPU Iterations alloc_megabytes
-----------------------------------------------------------------------------------------------------
A[N,N,batch]+B[N,N,batch]/f32/N:512/batch:2           26 us         26 us      25521               0
A[N,N,batch]+B[N,N,batch]/f32/N:1024/batch:2          64 us         64 us      10930               0
A[N,N,batch]+B[N,N,batch]/f32/N:2048/batch:2         190 us        190 us       3720               0
A[N,N,batch]+B[N,N,batch]/f32/N:512/batch:4           42 us         42 us      16703               0
A[N,N,batch]+B[N,N,batch]/f32/N:1024/batch:4         108 us        108 us       6599               0
A[N,N,batch]+B[N,N,batch]/f32/N:2048/batch:4         356 us        355 us       1950               0
A[N,N,batch]+B[N,N,batch]/f32/N:512/batch:8           65 us         65 us      10801               0
A[N,N,batch]+B[N,N,batch]/f32/N:1024/batch:8         190 us        190 us       3701               0
A[N,N,batch]+B[N,N,batch]/f32/N:2048/batch:8         693 us        691 us       1013               0
A[N,N,batch]+B[N,N,batch]/f32/N:512/batch:16         107 us        107 us       6535               0
A[N,N,batch]+B[N,N,batch]/f32/N:1024/batch:16        356 us        355 us       1956               0
A[N,N,batch]+B[N,N,batch]/f32/N:2048/batch:16       1355 us       1353 us        514               0
A[N,N,batch]+B[N,N,batch]/f32/N:512/batch:32         191 us        191 us       3702               0
A[N,N,batch]+B[N,N,batch]/f32/N:1024/batch:32        693 us        692 us       1008               0
A[N,N,batch]+B[N,N,batch]/f32/N:2048/batch:32       2701 us       2697 us        258               0
A[N,N,batch]+B[N,N,batch]/f32/N:512/batch:64         369 us        368 us       1907               0
A[N,N,batch]+B[N,N,batch]/f32/N:1024/batch:64       1369 us       1366 us        510               0
A[N,N,batch]+B[N,N,batch]/f32/N:2048/batch:64       5385 us       5377 us        119               0
A[N,N,batch]+tile(B)/f32/N:512/batch:2                32 us         32 us      22167               0
A[N,N,batch]+tile(B)/f32/N:1024/batch:2               58 us         58 us      11874               0
A[N,N,batch]+tile(B)/f32/N:2048/batch:2              150 us        150 us       4701               0
A[N,N,batch]+tile(B)/f32/N:512/batch:4                43 us         43 us      16509               0
A[N,N,batch]+tile(B)/f32/N:1024/batch:4               88 us         88 us       7948               0
A[N,N,batch]+tile(B)/f32/N:2048/batch:4              285 us        285 us       2488               0
A[N,N,batch]+tile(B)/f32/N:512/batch:8                59 us         59 us      11808               0
A[N,N,batch]+tile(B)/f32/N:1024/batch:8              153 us        153 us       4574               0
A[N,N,batch]+tile(B)/f32/N:2048/batch:8              555 us        554 us       1261               0
A[N,N,batch]+tile(B)/f32/N:512/batch:16               88 us         88 us       7908               0
A[N,N,batch]+tile(B)/f32/N:1024/batch:16             289 us        288 us       2400               0
A[N,N,batch]+tile(B)/f32/N:2048/batch:16            1061 us       1060 us        651               0
A[N,N,batch]+tile(B)/f32/N:512/batch:32              150 us        150 us       4620               0
A[N,N,batch]+tile(B)/f32/N:1024/batch:32             548 us        547 us       1228               0
A[N,N,batch]+tile(B)/f32/N:2048/batch:32            2125 us       2122 us        328               0
A[N,N,batch]+tile(B)/f32/N:512/batch:64              282 us        281 us       2492               0
A[N,N,batch]+tile(B)/f32/N:1024/batch:64            1084 us       1083 us        658               0
A[N,N,batch]+tile(B)/f32/N:2048/batch:64            4204 us       4198 us        167               0

Tiles seem to be a bit faster.

Ok so I ve forced a release build and looks like tiled add still slower than raw add on GTX1060 but not on V100. Probably related to index types : int or long long. TBC.