raw addition vs tiled addition
WilliamTambellini opened this issue · comments
Hi @umar456
When doing a tiled addition
C = A + tile(B, 1, A.dims(1), A.dims(2)); // B being a 1d vector
vs a raw non tiled addition
C = A + B; // A and B being same shape
I would expect afcuda to use less mem but looks like it is not.
Would you know why ?
Thanks for the benchmark! Looking into it now.
The allocations you are seeing is from the C array which is allocated in the benchmark. This array is the same size in both cases. Also, you are calling the addBase
function in both cases. If you fix the tiled benchmark declaration you should see the expected behaviour.
Please feel free to create a PR to add this benchmark. I can also do it if you don't want to.
Cool, thanks.
$ ./batched_add
2018-11-28 12:12:31
Running ./batched_add
Run on (8 X 3500 MHz CPU s)
CPU Caches:
L1 Data 32K (x4)
L1 Instruction 32K (x4)
L2 Unified 256K (x4)
L3 Unified 6144K (x1)
WARNING CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
WARNING Library was built as DEBUG. Timings may be affected.
Context:
ArrayFire v3.6.2 (dc38ef1) (Backend: CUDA Platform: CUDA)
Device: GeForce_GTX_1060 (6.1 v9.2)
Benchmark Time CPU Iterations alloc_megabytes
A[N,N,batch]+B[N,N,batch]/f32/N:512/batch:2 53 us 53 us 12797 0
A[N,N,batch]+B[N,N,batch]/f32/N:1024/batch:2 176 us 176 us 3915 0
A[N,N,batch]+B[N,N,batch]/f32/N:2048/batch:2 665 us 665 us 987 0
A[N,N,batch]+B[N,N,batch]/f32/N:512/batch:4 94 us 94 us 7132 0
A[N,N,batch]+B[N,N,batch]/f32/N:1024/batch:4 338 us 338 us 2079 0
A[N,N,batch]+B[N,N,batch]/f32/N:2048/batch:4 1312 us 1312 us 519 0
A[N,N,batch]+B[N,N,batch]/f32/N:512/batch:8 175 us 175 us 3914 0
A[N,N,batch]+B[N,N,batch]/f32/N:1024/batch:8 666 us 666 us 967 0
A[N,N,batch]+B[N,N,batch]/f32/N:2048/batch:8 2617 us 2617 us 269 0
A[N,N,batch]+B[N,N,batch]/f32/N:512/batch:16 375 us 375 us 2063 0
A[N,N,batch]+B[N,N,batch]/f32/N:1024/batch:16 1313 us 1313 us 515 0
A[N,N,batch]+B[N,N,batch]/f32/N:2048/batch:16 5217 us 5217 us 124 0
A[N,N,batch]+B[N,N,batch]/f32/N:512/batch:32 663 us 663 us 994 0
A[N,N,batch]+B[N,N,batch]/f32/N:1024/batch:32 2631 us 2631 us 269 0
A[N,N,batch]+B[N,N,batch]/f32/N:2048/batch:32 10413 us 10413 us 68 0
A[N,N,batch]+B[N,N,batch]/f32/N:512/batch:64 1311 us 1311 us 528 0
A[N,N,batch]+B[N,N,batch]/f32/N:1024/batch:64 5218 us 5218 us 124 0
A[N,N,batch]+B[N,N,batch]/f32/N:2048/batch:64 20846 us 20846 us 33 0
A[N,N,batch]+tile(B)/f32/N:512/batch:2 86 us 86 us 6000 0
A[N,N,batch]+tile(B)/f32/N:1024/batch:2 302 us 302 us 2328 0
A[N,N,batch]+tile(B)/f32/N:2048/batch:2 1180 us 1180 us 577 0
A[N,N,batch]+tile(B)/f32/N:512/batch:4 158 us 158 us 4323 0
A[N,N,batch]+tile(B)/f32/N:1024/batch:4 594 us 594 us 1098 0
A[N,N,batch]+tile(B)/f32/N:2048/batch:4 2339 us 2339 us 295 0
A[N,N,batch]+tile(B)/f32/N:512/batch:8 301 us 301 us 2304 0
A[N,N,batch]+tile(B)/f32/N:1024/batch:8 1183 us 1183 us 573 0
A[N,N,batch]+tile(B)/f32/N:2048/batch:8 4685 us 4685 us 134 0
A[N,N,batch]+tile(B)/f32/N:512/batch:16 598 us 598 us 1073 0
A[N,N,batch]+tile(B)/f32/N:1024/batch:16 2347 us 2347 us 298 0
A[N,N,batch]+tile(B)/f32/N:2048/batch:16 9391 us 9391 us 71 0
A[N,N,batch]+tile(B)/f32/N:512/batch:32 1182 us 1182 us 575 0
A[N,N,batch]+tile(B)/f32/N:1024/batch:32 4693 us 4693 us 135 0
A[N,N,batch]+tile(B)/f32/N:2048/batch:32 19019 us 19019 us 36 0
A[N,N,batch]+tile(B)/f32/N:512/batch:64 2366 us 2366 us 298 0
A[N,N,batch]+tile(B)/f32/N:1024/batch:64 9403 us 9403 us 71 0
A[N,N,batch]+tile(B)/f32/N:2048/batch:64 38368 us 38369 us 18 0
So tiled add seems twice slower than non tiled : is it expected ?
Note: you can merge that little bench in this repo if you d like to.
This is what I am getting on the GV100:
Context:
ArrayFire v3.7.0 (b8daf96a) (Backend: CUDA Platform: CUDA)
Device: Quadro_GV100 (7.0 v10)
-----------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations alloc_megabytes
-----------------------------------------------------------------------------------------------------
A[N,N,batch]+B[N,N,batch]/f32/N:512/batch:2 26 us 26 us 25521 0
A[N,N,batch]+B[N,N,batch]/f32/N:1024/batch:2 64 us 64 us 10930 0
A[N,N,batch]+B[N,N,batch]/f32/N:2048/batch:2 190 us 190 us 3720 0
A[N,N,batch]+B[N,N,batch]/f32/N:512/batch:4 42 us 42 us 16703 0
A[N,N,batch]+B[N,N,batch]/f32/N:1024/batch:4 108 us 108 us 6599 0
A[N,N,batch]+B[N,N,batch]/f32/N:2048/batch:4 356 us 355 us 1950 0
A[N,N,batch]+B[N,N,batch]/f32/N:512/batch:8 65 us 65 us 10801 0
A[N,N,batch]+B[N,N,batch]/f32/N:1024/batch:8 190 us 190 us 3701 0
A[N,N,batch]+B[N,N,batch]/f32/N:2048/batch:8 693 us 691 us 1013 0
A[N,N,batch]+B[N,N,batch]/f32/N:512/batch:16 107 us 107 us 6535 0
A[N,N,batch]+B[N,N,batch]/f32/N:1024/batch:16 356 us 355 us 1956 0
A[N,N,batch]+B[N,N,batch]/f32/N:2048/batch:16 1355 us 1353 us 514 0
A[N,N,batch]+B[N,N,batch]/f32/N:512/batch:32 191 us 191 us 3702 0
A[N,N,batch]+B[N,N,batch]/f32/N:1024/batch:32 693 us 692 us 1008 0
A[N,N,batch]+B[N,N,batch]/f32/N:2048/batch:32 2701 us 2697 us 258 0
A[N,N,batch]+B[N,N,batch]/f32/N:512/batch:64 369 us 368 us 1907 0
A[N,N,batch]+B[N,N,batch]/f32/N:1024/batch:64 1369 us 1366 us 510 0
A[N,N,batch]+B[N,N,batch]/f32/N:2048/batch:64 5385 us 5377 us 119 0
A[N,N,batch]+tile(B)/f32/N:512/batch:2 32 us 32 us 22167 0
A[N,N,batch]+tile(B)/f32/N:1024/batch:2 58 us 58 us 11874 0
A[N,N,batch]+tile(B)/f32/N:2048/batch:2 150 us 150 us 4701 0
A[N,N,batch]+tile(B)/f32/N:512/batch:4 43 us 43 us 16509 0
A[N,N,batch]+tile(B)/f32/N:1024/batch:4 88 us 88 us 7948 0
A[N,N,batch]+tile(B)/f32/N:2048/batch:4 285 us 285 us 2488 0
A[N,N,batch]+tile(B)/f32/N:512/batch:8 59 us 59 us 11808 0
A[N,N,batch]+tile(B)/f32/N:1024/batch:8 153 us 153 us 4574 0
A[N,N,batch]+tile(B)/f32/N:2048/batch:8 555 us 554 us 1261 0
A[N,N,batch]+tile(B)/f32/N:512/batch:16 88 us 88 us 7908 0
A[N,N,batch]+tile(B)/f32/N:1024/batch:16 289 us 288 us 2400 0
A[N,N,batch]+tile(B)/f32/N:2048/batch:16 1061 us 1060 us 651 0
A[N,N,batch]+tile(B)/f32/N:512/batch:32 150 us 150 us 4620 0
A[N,N,batch]+tile(B)/f32/N:1024/batch:32 548 us 547 us 1228 0
A[N,N,batch]+tile(B)/f32/N:2048/batch:32 2125 us 2122 us 328 0
A[N,N,batch]+tile(B)/f32/N:512/batch:64 282 us 281 us 2492 0
A[N,N,batch]+tile(B)/f32/N:1024/batch:64 1084 us 1083 us 658 0
A[N,N,batch]+tile(B)/f32/N:2048/batch:64 4204 us 4198 us 167 0
Tiles seem to be a bit faster.
Ok so I ve forced a release build and looks like tiled add still slower than raw add on GTX1060 but not on V100. Probably related to index types : int or long long. TBC.