Countmap is 100x slower than it should be in some cases

Question

Countmap is 100x slower than it should be in some cases

LilithHafner opened this issue a year ago · comments

Lilith Orion Hafner commented a year ago

julia> @btime countmap(x) setup=(x=rand(Int16.(1:10), 30));
  41.375 μs (6 allocations: 512.47 KiB)

julia> @btime countmap(x) setup=(x=rand(1:10, 30));
  266.611 ns (5 allocations: 848 bytes)

Andreas Noack · Answer 1 · Thu Jun 01 2023 02:24:16 GMT+0800 (China Standard Time)

See #339. There is a specialized method for small integer types which is faster when there are a lot of elements.

Lilith Orion Hafner · Answer 2 · Thu Jun 01 2023 04:02:28 GMT+0800 (China Standard Time)

Yeah, the runtime of both methods isO(n) and the alternate method has a better multiplicative constant factor but a much worse additive constant factor which is good for lots of elements but not so good for few elements. This is a rough approximation of when each method is better. There are some methodological issues but I think it's accurate to within an order of magnitude in all cases.

Code for figure

```julia using StatsBase, Plots function linear_regression(X, Y, x) X0 = mean(X); Y0 = mean(Y) m = sum((X .- X0) .* (Y .- Y0)) / sum((X .- X0).^2) (x-X0)m + Y0 end

t(T, a, b) = (x = rand(T.(1:a), b); minimum(@Elapsed(countmap(x)) for _ in 1:100))
f(a, b) = t(Int16, a, b)/t(Int64, a, b)
function g(a, target=1)
b = 10
while f(a, b) > target
b *= 10
end
input_sizes = round.(Int, exp.(range(log(b/10), log(b), length=10)))
time_ratios = f.(a, input_sizes)
exp(linear_regression(log.(time_ratios), log.(input_sizes), log(target)))
end

x = floor.(exp.(2:.2:log(typemax(Int16))))
@time y_even = g.(x)
@time y_slowdown = g.(x, 3)
x_speedup = vcat(x[5:5:end-3], last(x))
@time y_speedup = g.(x_speedup, 1/3)

plot(x_speedup, y_speedup, label="3x speedup")
plot!(x, y_even, xlabel="range of values", ylabel="number of elements",
xaxis=:log10, yaxis=:log10, xticks=7, yticks=7, label="even",
title="When is the method added in #339 good?")
plot!(x, y_slowdown, label="3x slowdown")
ylims!(10^2, 10^7)
savefig("figure.png")

Milan Bouchet-Valat · Answer 3 · Sat Jun 17 2023 19:45:42 GMT+0800 (China Standard Time)

Thanks for benchmarking. It could make sense to use that method only when there are more than e.g. 10_000 or 100_000 entries. Is the threshold different for Int16, Int8 and Bool?

Lilith Orion Hafner · Answer 4 · Sun Jun 18 2023 04:43:18 GMT+0800 (China Standard Time)

length(x) > Base.max_values(eltype(x)) seems like it would be a reasonable threshold (though avoid using Base.max_values because it is internal)

Milan Bouchet-Valat · Answer 5 · Sun Jun 18 2023 22:27:58 GMT+0800 (China Standard Time)

Why not. So you mean that for Int8 the specialized method is faster for arrays longer than 256 elements (roughly)? Would you make a pull request?

Lilith Orion Hafner · Answer 6 · Mon Jun 19 2023 07:31:56 GMT+0800 (China Standard Time)

I'm not eager to add heuristics and additional code complexity without robust performance analytics and unfortunately I don't have the bandwidth right now to conduct extensive benchmarking.

Milan Bouchet-Valat · Answer 7 · Mon Jun 19 2023 15:50:45 GMT+0800 (China Standard Time)

Given what you showed I would say that a rough heuristic is better than the current situation, but as you prefer.