scandum / fluxsort

A fast branchless stable quicksort / mergesort hybrid that is highly adaptive.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Suboptimal code-gen in the fundamental branchless-swap building block

Voultapher opened this issue · comments

The fundamental branchless swap_if code produces suboptimal code on x86-64. I ported it to Rust and noticed that changing it yielded a 50% performance uplift for that function on Zen3, this will of course depend on the the hardware, but cmov seems to yield better results than setl/setg style code that is currently being produced. Probably helped by doing 8 instead of 10 instructions.

Here is the current version:

And here is the version that produces cmov code:

I think if you can find a way to reliably produce cmov instructions like LLVM does, you should see a noticeable speed improvement.

I looked into that in the past, but it doesn't produce good results on my own system. I'm not quite sure whether the code is at fault or the compiler.

Is there any definite consensus on the right way to perform branchless swaps?

I'm not sure there is consensus, but I saw a very significant speedup with cmov vs setl/ge code on Zen3, Broadwell, Skylake and on Firestorm (M1) LLVM was already producing csel code for both versions. How did you test it? Because I notice you no_inline the comparison function and that has disastrous effects on performance. Here the difference to languages with template instantiation / monomorphization is the most acute. If I understand correctly you pull in everything into the header, akin to header only libraries. But even then LTO should level the playing field I guess.

When it comes to performance testing I always uncomment this line in bench.c

//#define cmp(a,b) (*(a) > *(b)) // uncomment for fast primitive comparisons

That allows a fair comparison against c++ sorts.

I took a closer look at this. As far as I can tell, overall branchless swap performance is worse for gcc and clang on my hardware.

Ideally, you get that cmov without too much hassle. The current branchless compilation situation is a royal mess.

In addition, clang performs horribly on most of my core algorithms, some code running 2x slower. Hopefully it's a simple fix.

@Voultapher

https://github.com/Voultapher/sort-research-rs/blob/main/writeup/glidesort_perf_analysis/text.md

Just saw your benchmark. I've recently released a fluxsort and quadsort update with compile-time optimizations for clang. Overall, quadsort should be the fastest sort for random when compiled with clang -O3 for smaller ranges.

I also added the quadsort_prim() and fluxsort_prim() functions so it's possible to benchmark 32/64 bit primitive integers and C strings with the same binary. The bench.c file contains an example for sorting C strings.

Pretty good overall performance for ipn_stable, is the performance on rand % 2 purely from an exponential search in a galloping merge?

I ran a benchmark of my own using rhsort's benchmark compiled with clang -O3. This suggests most of the performance gain on rust is from branchless ternary operations, though timsort does quite well on long variable runs.

image

data table
Name Items Type Best Average Loops Samples Distribution
quadsort 131072 32 0.002134 0.002152 0 100 random order
fluxsort 131072 32 0.002464 0.002502 0 100 random order
glidesort 131072 32 0.002999 0.003017 0 100 random order
quadsort 131072 32 0.001709 0.001733 0 100 random % 100
fluxsort 131072 32 0.000902 0.000908 0 100 random % 100
glidesort 131072 32 0.001011 0.001035 0 100 random % 100
quadsort 131072 32 0.000061 0.000062 0 100 ascending order
fluxsort 131072 32 0.000058 0.000059 0 100 ascending order
glidesort 131072 32 0.000091 0.000092 0 100 ascending order
quadsort 131072 32 0.000335 0.000349 0 100 ascending saw
fluxsort 131072 32 0.000334 0.000339 0 100 ascending saw
glidesort 131072 32 0.000346 0.000356 0 100 ascending saw
quadsort 131072 32 0.000231 0.000242 0 100 pipe organ
fluxsort 131072 32 0.000222 0.000229 0 100 pipe organ
glidesort 131072 32 0.000229 0.000239 0 100 pipe organ
quadsort 131072 32 0.000073 0.000081 0 100 descending order
fluxsort 131072 32 0.000073 0.000082 0 100 descending order
glidesort 131072 32 0.000105 0.000109 0 100 descending order
quadsort 131072 32 0.000366 0.000369 0 100 descending saw
fluxsort 131072 32 0.000348 0.000354 0 100 descending saw
glidesort 131072 32 0.000357 0.000361 0 100 descending saw
quadsort 131072 32 0.000687 0.000702 0 100 random tail
fluxsort 131072 32 0.000792 0.000819 0 100 random tail
glidesort 131072 32 0.000939 0.000970 0 100 random tail
quadsort 131072 32 0.001177 0.001200 0 100 random half
fluxsort 131072 32 0.001384 0.001401 0 100 random half
glidesort 131072 32 0.001625 0.001652 0 100 random half
quadsort 131072 32 0.001643 0.001686 0 100 ascending tiles
fluxsort 131072 32 0.000579 0.000590 0 100 ascending tiles
glidesort 131072 32 0.002516 0.002543 0 100 ascending tiles
quadsort 131072 32 0.002184 0.002199 0 100 bit reversal
fluxsort 131072 32 0.002223 0.002257 0 100 bit reversal
glidesort 131072 32 0.002735 0.002765 0 100 bit reversal
quadsort 131072 32 0.001456 0.001474 0 100 random % 2
fluxsort 131072 32 0.000359 0.000364 0 100 random % 2
glidesort 131072 32 0.000443 0.000464 0 100 random % 2
quadsort 131072 32 0.001332 0.001362 0 100 signal
fluxsort 131072 32 0.001587 0.001602 0 100 signal
glidesort 131072 32 0.003688 0.003711 0 100 signal
quadsort 131072 32 0.001923 0.001947 0 100 exponential
fluxsort 131072 32 0.001281 0.001291 0 100 exponential
glidesort 131072 32 0.002313 0.002335 0 100 exponential