Speedup _observed_ with dynamic broadcasting

Question

Speedup _observed_ with dynamic broadcasting

navidcy opened this issue 2 years ago · comments

Navid C. Constantinou commented 2 years ago

Despite the claims in the README, I actually get:

julia> b = [1.0];

julia> @btime foo9($a, $b, $c, $d, $e, $f, $g, $h, $i);
  45.125 μs (0 allocations: 0 bytes)

julia> @btime fast_foo9($a, $b, $c, $d, $e, $f, $g, $h, $i);
  18.375 μs (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161 (2022-11-14 20:14 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.6.0)
  CPU: 10 × Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 6 on 8 virtual cores
Environment:
  JULIA_EDITOR = code

Chris Elrod · Answer 1 · Sun Nov 20 2022 20:19:09 GMT+0800 (China Standard Time)

What's the problem? Your fast_foo9 is over 2x faster.

EDIT: oh, even when broadcasting b. Huh.

Chris Elrod · Answer 2 · Sun Nov 20 2022 20:30:49 GMT+0800 (China Standard Time)

FWIW, I got

julia> using FastBroadcast

julia> function fast_foo9(a, b, c, d, e, f, g, h, i)
           @.. a = b + 0.1 * (0.2c + 0.3d + 0.4e + 0.5f + 0.6g + 0.6h + 0.6i)
           nothing
       end
fast_foo9 (generic function with 1 method)

julia> function foo9(a, b, c, d, e, f, g, h, i)
           @. a = b + 0.1 * (0.2c + 0.3d + 0.4e + 0.5f + 0.6g + 0.6h + 0.6i)
           nothing
       end
foo9 (generic function with 1 method)

julia> a, b, c, d, e, f, g, h, i = [rand(100, 100, 2) for i in 1:9];

julia> using BenchmarkTools

julia> @btime fast_foo9($a, $b, $c, $d, $e, $f, $g, $h, $i);
  38.674 μs (0 allocations: 0 bytes)

julia> @btime foo9($a, $b, $c, $d, $e, $f, $g, $h, $i);
  83.503 μs (0 allocations: 0 bytes)

julia> b = [1.0];

julia> @btime foo9($a, $b, $c, $d, $e, $f, $g, $h, $i);
  85.732 μs (0 allocations: 0 bytes)

julia> @btime fast_foo9($a, $b, $c, $d, $e, $f, $g, $h, $i);
  30.452 μs (0 allocations: 0 bytes)

So I can reproduce.

Chris Elrod · Answer 3 · Sun Nov 20 2022 21:47:27 GMT+0800 (China Standard Time)

Comparing 30k evaluations, where b is fullsize and bs is the small version:

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
         foreachf(fast_foo9, 30_000, a, bs, c, d, e, f, g, h, i)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               3.60e+09   49.9%  #  3.6 cycles per ns
┌ instructions             2.96e+09   50.0%  #  0.8 insns per cycle
│ branch-instructions      2.27e+08   50.0%  #  7.7% of insns
└ branch-misses            1.85e+06   50.0%  #  0.8% of branch insns
┌ task-clock               1.01e+09  100.0%  #  1.0 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              4.00e+00  100.0%
┌ L1-dcache-load-misses    6.09e+08   25.0%  # 48.6% of dcache loads
│ L1-dcache-loads          1.25e+09   25.0%
└ L1-icache-load-misses    8.45e+06   25.0%
┌ dTLB-load-misses         1.23e+05   25.0%  #  0.0% of dTLB loads
└ dTLB-loads               1.25e+09   25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
         foreachf(fast_foo9, 30_000, a, b, c, d, e, f, g, h, i)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               3.90e+09   49.9%  #  3.6 cycles per ns
┌ instructions             1.43e+09   50.0%  #  0.4 insns per cycle
│ branch-instructions      7.52e+07   50.0%  #  5.3% of insns
└ branch-misses            3.01e+04   50.0%  #  0.0% of branch insns
┌ task-clock               1.09e+09  100.0%  #  1.1 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    6.76e+08   25.0%  # 112.4% of dcache loads
│ L1-dcache-loads          6.02e+08   25.0%
└ L1-icache-load-misses    1.71e+04   25.0%
┌ dTLB-load-misses         4.01e+00   25.0%  #  0.0% of dTLB loads
└ dTLB-loads               6.02e+08   25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
         foreachf(foo9, 30_000, a, b, c, d, e, f, g, h, i)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               9.86e+09   50.0%  #  3.8 cycles per ns
┌ instructions             3.07e+10   50.0%  #  3.1 insns per cycle
│ branch-instructions      6.37e+08   50.0%  #  2.1% of insns
└ branch-misses            6.58e+06   50.0%  #  1.0% of branch insns
┌ task-clock               2.59e+09  100.0%  #  2.6 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              4.80e+01  100.0%
┌ L1-dcache-load-misses    6.80e+08   25.0%  #  5.5% of dcache loads
│ L1-dcache-loads          1.24e+10   25.0%
└ L1-icache-load-misses    1.13e+06   25.0%
┌ dTLB-load-misses         7.47e+03   25.0%  #  0.0% of dTLB loads
└ dTLB-loads               1.24e+10   25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
         foreachf(foo9, 30_000, a, bs, c, d, e, f, g, h, i)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               1.12e+10   50.0%  #  3.8 cycles per ns
┌ instructions             3.18e+10   50.0%  #  2.8 insns per cycle
│ branch-instructions      8.62e+08   50.0%  #  2.7% of insns
└ branch-misses            9.41e+06   50.0%  #  1.1% of branch insns
┌ task-clock               2.93e+09  100.0%  #  2.9 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              1.26e+03  100.0%
┌ L1-dcache-load-misses    6.16e+08   25.0%  #  4.3% of dcache loads
│ L1-dcache-loads          1.45e+10   25.0%
└ L1-icache-load-misses    1.59e+07   25.0%
┌ dTLB-load-misses         2.51e+05   25.0%  #  0.0% of dTLB loads
└ dTLB-loads               1.45e+10   25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

It needs twice as many instructions for the small b, but performance is totally dominated by memory bandwidth so it doesn't really matter.
While regular foo9 requires 10-20x the instructions for some reason.

Yingbo Ma · Answer 4 · Mon Nov 21 2022 11:29:47 GMT+0800 (China Standard Time)

Should we fix this inaccuracy by inserting a sleep call in the dynamic broadcasting branch?

Chris Elrod · Answer 5 · Mon Nov 21 2022 13:11:04 GMT+0800 (China Standard Time)

Should we fix this inaccuracy by inserting a sleep call in the dynamic broadcasting branch?

Probably better to update the README instead, as the README claims FastBroadcast is slower than base broadcasting for dynamic broadcasts.