performance on clamp
johnnychen94 opened this issue · comments
This is expected, but I think it can be optimized.
julia> @btime clamp!($(rand(N0f8, 512, 512)), 0.1, 0.9);
813.628 μs (0 allocations: 0 bytes)
julia> @btime clamp!($(rand(N0f8, 512, 512)), 0.1N0f8, 0.9N0f8);
628.583 μs (0 allocations: 0 bytes)
julia> @btime clamp.($(rand(N0f8, 512, 512)), 0.1, 0.9);
691.843 μs (3 allocations: 2.00 MiB)
julia> @btime clamp.($(rand(N0f8, 512, 512)), 0.1N0f8, 0.9N0f8);
1.146 ms (3 allocations: 256.19 KiB)
Ref: observed in JuliaImages/ImageContrastAdjustment.jl#28 (comment)
I don't understand what the benchmarks above are intended for.:confused:
Is this what you mean?
julia> @btime clamp!($(zeros(N0f8, 512, 512)), 0.1, 0.9);
505.900 μs (0 allocations: 0 bytes)
julia> @btime clamp!($(zeros(N0f8, 512, 512)), 0.1N0f8, 0.9N0f8);
114.299 μs (0 allocations: 0 bytes)
julia> @btime clamp.($(zeros(N0f8, 512, 512)), 0.1, 0.9);
687.300 μs (2 allocations: 2.00 MiB)
julia> @btime clamp.($(zeros(N0f8, 512, 512)), 0.1N0f8, 0.9N0f8);
179.200 μs (2 allocations: 256.14 KiB)
Yes, I also think it can be optimized (specialized) as follows:
Base.clamp(x::X, lo::X, hi::X) where X<:FixedPoint = X(clamp(x.i, lo.i, hi.i), 0)
However, the bottleneck should be identified first. I don't think floating-point conversions are "too" slow. They are inherently expensive.
I was thinking a specification like this:
Base.clamp(x::X, lo, hi) where X<:FixedPoint = clamp(x, X(lo), X(hi))
I opened this as a potential issue tracker and haven't spent some time playing with it, feel free to close it if you think this makes non-sense.
clamp!
can be optimized, but I don't want to do it because there are many "in-place" variants.
clamp
must not be changed as you suggested. FixedPoint
is not a pixel type (e.g. Gray{N0f8}
).
feel free to close it if you think this makes non-sense.
This issue clearly makes sense, but I don't know what you want.
TBH, I had believed that the optimization for clamp(x::X, lo::X, hi::X) where X<:FixedPoint
was done by the compiler (LLVM backend). I think it is worth introducing.:rocket:
Oh, I wrote this done as a quick note to myself and I didn't spend enough time playing with it, I'm sorry I didn't make it clear.
This is expected, but I think it can be optimized.
I was meant to say "I believe this performance gap is expected, but I think there're merits to proactively convert lo
and hi
to FixedPoint type because I believe the general philosophy is to not care about storage type since it's hard to intuit the types in practice, and when we try to tweak performance by converting/promoting types, there's a code smell to me.
I'm in a bit of a hurry catching up with my own ddls, I'll post the update later if I find anything worth optimizing.
clamp! can be optimized, but I don't want to do it because there are many "in-place" variants.
I don't quite understand the reasoning behind this, can you clarify it a bit? Thanks. My understanding is that in-place functions in Base are worth extending.
clamp! can be optimized, but I don't want to do it because there are many "in-place" variants.
I don't quite understand the reasoning behind this, can you clarify it a bit?
The specialization of many "in-place" methods requires much labor and offers little benefit.
I believe the general philosophy is to not care about storage type
I agree with you. Therefore, I think clamp
should be used when we can use clamp
instead of clamp!
. When we use clamp!
, we must "definitely" pay attention to the storage type! The performance issues above are the result of "carelessness".:confused:
Also, if we follow the philosophy, clamp
MUST NOT convert the "return" type.
Certainly the specialization in #179 (comment) makes sense. The one in #179 (comment) is much more dangerous, since currently clamp(x, -Inf, Inf)
succeeds for all FixedPoint values but that definition would throw an error since Inf
does not have a representation as FixedPoint. We could, however, define clamp_fast
and make @fastmath
do semantic replacement (this would presumably require a PR to Base). It might take a little work to get it to hoist the conversion for broadcasting, though.
Cases like @fastmath clamp.(A * B, lo, hi)
would be trouble, though, if A
and B
don't have the same element type. This could get complex pretty quickly. The most robust approach is to pass lo
and hi
of the correct type manually.
I merged PR #194. If we try to improve it further I will keep this issue open, otherwise I will close this issue.
Thanks for working on this. #194 looks like a good-enough improvement to me.