Speedup _observed_ with dynamic broadcasting
navidcy opened this issue · comments
Despite the claims in the README, I actually get:
julia> b = [1.0];
julia> @btime foo9($a, $b, $c, $d, $e, $f, $g, $h, $i);
45.125 μs (0 allocations: 0 bytes)
julia> @btime fast_foo9($a, $b, $c, $d, $e, $f, $g, $h, $i);
18.375 μs (0 allocations: 0 bytes)
julia> versioninfo()
Julia Version 1.8.3
Commit 0434deb161 (2022-11-14 20:14 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin21.6.0)
CPU: 10 × Apple M1 Max
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
Threads: 6 on 8 virtual cores
Environment:
JULIA_EDITOR = code
What's the problem? Your fast_foo9
is over 2x faster.
EDIT: oh, even when broadcasting b
. Huh.
FWIW, I got
julia> using FastBroadcast
julia> function fast_foo9(a, b, c, d, e, f, g, h, i)
@.. a = b + 0.1 * (0.2c + 0.3d + 0.4e + 0.5f + 0.6g + 0.6h + 0.6i)
nothing
end
fast_foo9 (generic function with 1 method)
julia> function foo9(a, b, c, d, e, f, g, h, i)
@. a = b + 0.1 * (0.2c + 0.3d + 0.4e + 0.5f + 0.6g + 0.6h + 0.6i)
nothing
end
foo9 (generic function with 1 method)
julia> a, b, c, d, e, f, g, h, i = [rand(100, 100, 2) for i in 1:9];
julia> using BenchmarkTools
julia> @btime fast_foo9($a, $b, $c, $d, $e, $f, $g, $h, $i);
38.674 μs (0 allocations: 0 bytes)
julia> @btime foo9($a, $b, $c, $d, $e, $f, $g, $h, $i);
83.503 μs (0 allocations: 0 bytes)
julia> b = [1.0];
julia> @btime foo9($a, $b, $c, $d, $e, $f, $g, $h, $i);
85.732 μs (0 allocations: 0 bytes)
julia> @btime fast_foo9($a, $b, $c, $d, $e, $f, $g, $h, $i);
30.452 μs (0 allocations: 0 bytes)
So I can reproduce.
Comparing 30k evaluations, where b
is fullsize and bs
is the small version:
julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
foreachf(fast_foo9, 30_000, a, bs, c, d, e, f, g, h, i)
end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles 3.60e+09 49.9% # 3.6 cycles per ns
┌ instructions 2.96e+09 50.0% # 0.8 insns per cycle
│ branch-instructions 2.27e+08 50.0% # 7.7% of insns
└ branch-misses 1.85e+06 50.0% # 0.8% of branch insns
┌ task-clock 1.01e+09 100.0% # 1.0 s
│ context-switches 0.00e+00 100.0%
│ cpu-migrations 0.00e+00 100.0%
└ page-faults 4.00e+00 100.0%
┌ L1-dcache-load-misses 6.09e+08 25.0% # 48.6% of dcache loads
│ L1-dcache-loads 1.25e+09 25.0%
└ L1-icache-load-misses 8.45e+06 25.0%
┌ dTLB-load-misses 1.23e+05 25.0% # 0.0% of dTLB loads
└ dTLB-loads 1.25e+09 25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
foreachf(fast_foo9, 30_000, a, b, c, d, e, f, g, h, i)
end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles 3.90e+09 49.9% # 3.6 cycles per ns
┌ instructions 1.43e+09 50.0% # 0.4 insns per cycle
│ branch-instructions 7.52e+07 50.0% # 5.3% of insns
└ branch-misses 3.01e+04 50.0% # 0.0% of branch insns
┌ task-clock 1.09e+09 100.0% # 1.1 s
│ context-switches 0.00e+00 100.0%
│ cpu-migrations 0.00e+00 100.0%
└ page-faults 0.00e+00 100.0%
┌ L1-dcache-load-misses 6.76e+08 25.0% # 112.4% of dcache loads
│ L1-dcache-loads 6.02e+08 25.0%
└ L1-icache-load-misses 1.71e+04 25.0%
┌ dTLB-load-misses 4.01e+00 25.0% # 0.0% of dTLB loads
└ dTLB-loads 6.02e+08 25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
foreachf(foo9, 30_000, a, b, c, d, e, f, g, h, i)
end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles 9.86e+09 50.0% # 3.8 cycles per ns
┌ instructions 3.07e+10 50.0% # 3.1 insns per cycle
│ branch-instructions 6.37e+08 50.0% # 2.1% of insns
└ branch-misses 6.58e+06 50.0% # 1.0% of branch insns
┌ task-clock 2.59e+09 100.0% # 2.6 s
│ context-switches 0.00e+00 100.0%
│ cpu-migrations 0.00e+00 100.0%
└ page-faults 4.80e+01 100.0%
┌ L1-dcache-load-misses 6.80e+08 25.0% # 5.5% of dcache loads
│ L1-dcache-loads 1.24e+10 25.0%
└ L1-icache-load-misses 1.13e+06 25.0%
┌ dTLB-load-misses 7.47e+03 25.0% # 0.0% of dTLB loads
└ dTLB-loads 1.24e+10 25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
foreachf(foo9, 30_000, a, bs, c, d, e, f, g, h, i)
end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles 1.12e+10 50.0% # 3.8 cycles per ns
┌ instructions 3.18e+10 50.0% # 2.8 insns per cycle
│ branch-instructions 8.62e+08 50.0% # 2.7% of insns
└ branch-misses 9.41e+06 50.0% # 1.1% of branch insns
┌ task-clock 2.93e+09 100.0% # 2.9 s
│ context-switches 0.00e+00 100.0%
│ cpu-migrations 0.00e+00 100.0%
└ page-faults 1.26e+03 100.0%
┌ L1-dcache-load-misses 6.16e+08 25.0% # 4.3% of dcache loads
│ L1-dcache-loads 1.45e+10 25.0%
└ L1-icache-load-misses 1.59e+07 25.0%
┌ dTLB-load-misses 2.51e+05 25.0% # 0.0% of dTLB loads
└ dTLB-loads 1.45e+10 25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
It needs twice as many instructions for the small b
, but performance is totally dominated by memory bandwidth so it doesn't really matter.
While regular foo9
requires 10-20x the instructions for some reason.
Should we fix this inaccuracy by inserting a sleep call in the dynamic broadcasting branch?
Should we fix this inaccuracy by inserting a sleep call in the dynamic broadcasting branch?
Probably better to update the README instead, as the README claims FastBroadcast is slower than base broadcasting for dynamic broadcasts.