Suboptimal code-gen in the fundamental branchless-swap building block

Question

Suboptimal code-gen in the fundamental branchless-swap building block

Voultapher opened this issue 2 years ago · comments

The fundamental branchless swap_if code produces suboptimal code on x86-64. I ported it to Rust and noticed that changing it yielded a 50% performance uplift for that function on Zen3, this will of course depend on the the hardware, but cmov seems to yield better results than setl/setg style code that is currently being produced. Probably helped by doing 8 instead of 10 instructions.

Here is the current version:

And here is the version that produces cmov code:

C https://godbolt.org/z/GrTvx1z8x (WIP, only good code gen for clang LLVM)
Rust https://godbolt.org/z/9qnfY6h3v

I think if you can find a way to reliably produce cmov instructions like LLVM does, you should see a noticeable speed improvement.

Igor van den Hoven · Answer 1 · Sun Nov 06 2022 17:35:18 GMT+0800 (China Standard Time)

I looked into that in the past, but it doesn't produce good results on my own system. I'm not quite sure whether the code is at fault or the compiler.

Is there any definite consensus on the right way to perform branchless swaps?

Lukas Bergdoll · Answer 2 · Sun Nov 06 2022 20:35:47 GMT+0800 (China Standard Time)

I'm not sure there is consensus, but I saw a very significant speedup with cmov vs setl/ge code on Zen3, Broadwell, Skylake and on Firestorm (M1) LLVM was already producing csel code for both versions. How did you test it? Because I notice you no_inline the comparison function and that has disastrous effects on performance. Here the difference to languages with template instantiation / monomorphization is the most acute. If I understand correctly you pull in everything into the header, akin to header only libraries. But even then LTO should level the playing field I guess.

Igor van den Hoven · Answer 3 · Sun Nov 06 2022 22:43:25 GMT+0800 (China Standard Time)

When it comes to performance testing I always uncomment this line in bench.c

//#define cmp(a,b) (*(a) > *(b)) // uncomment for fast primitive comparisons

That allows a fair comparison against c++ sorts.

Igor van den Hoven · Answer 4 · Mon Feb 13 2023 08:05:28 GMT+0800 (China Standard Time)

I took a closer look at this. As far as I can tell, overall branchless swap performance is worse for gcc and clang on my hardware.

Ideally, you get that cmov without too much hassle. The current branchless compilation situation is a royal mess.

In addition, clang performs horribly on most of my core algorithms, some code running 2x slower. Hopefully it's a simple fix.

Igor van den Hoven · Answer 5 · Fri Feb 24 2023 18:16:16 GMT+0800 (China Standard Time)

@Voultapher

https://github.com/Voultapher/sort-research-rs/blob/main/writeup/glidesort_perf_analysis/text.md

Just saw your benchmark. I've recently released a fluxsort and quadsort update with compile-time optimizations for clang. Overall, quadsort should be the fastest sort for random when compiled with clang -O3 for smaller ranges.

I also added the quadsort_prim() and fluxsort_prim() functions so it's possible to benchmark 32/64 bit primitive integers and C strings with the same binary. The bench.c file contains an example for sorting C strings.

Pretty good overall performance for ipn_stable, is the performance on rand % 2 purely from an exponential search in a galloping merge?

I ran a benchmark of my own using rhsort's benchmark compiled with clang -O3. This suggests most of the performance gain on rust is from branchless ternary operations, though timsort does quite well on long variable runs.

data table

Name	Items	Type	Best	Average	Loops	Samples	Distribution
quadsort	131072	32	0.002134	0.002152	0	100	random order
fluxsort	131072	32	0.002464	0.002502	0	100	random order
glidesort	131072	32	0.002999	0.003017	0	100	random order

quadsort	131072	32	0.001709	0.001733	0	100	random % 100
fluxsort	131072	32	0.000902	0.000908	0	100	random % 100
glidesort	131072	32	0.001011	0.001035	0	100	random % 100

quadsort	131072	32	0.000061	0.000062	0	100	ascending order
fluxsort	131072	32	0.000058	0.000059	0	100	ascending order
glidesort	131072	32	0.000091	0.000092	0	100	ascending order

quadsort	131072	32	0.000335	0.000349	0	100	ascending saw
fluxsort	131072	32	0.000334	0.000339	0	100	ascending saw
glidesort	131072	32	0.000346	0.000356	0	100	ascending saw

quadsort	131072	32	0.000231	0.000242	0	100	pipe organ
fluxsort	131072	32	0.000222	0.000229	0	100	pipe organ
glidesort	131072	32	0.000229	0.000239	0	100	pipe organ

quadsort	131072	32	0.000073	0.000081	0	100	descending order
fluxsort	131072	32	0.000073	0.000082	0	100	descending order
glidesort	131072	32	0.000105	0.000109	0	100	descending order

quadsort	131072	32	0.000366	0.000369	0	100	descending saw
fluxsort	131072	32	0.000348	0.000354	0	100	descending saw
glidesort	131072	32	0.000357	0.000361	0	100	descending saw

quadsort	131072	32	0.000687	0.000702	0	100	random tail
fluxsort	131072	32	0.000792	0.000819	0	100	random tail
glidesort	131072	32	0.000939	0.000970	0	100	random tail

quadsort	131072	32	0.001177	0.001200	0	100	random half
fluxsort	131072	32	0.001384	0.001401	0	100	random half
glidesort	131072	32	0.001625	0.001652	0	100	random half

quadsort	131072	32	0.001643	0.001686	0	100	ascending tiles
fluxsort	131072	32	0.000579	0.000590	0	100	ascending tiles
glidesort	131072	32	0.002516	0.002543	0	100	ascending tiles

quadsort	131072	32	0.002184	0.002199	0	100	bit reversal
fluxsort	131072	32	0.002223	0.002257	0	100	bit reversal
glidesort	131072	32	0.002735	0.002765	0	100	bit reversal

quadsort	131072	32	0.001456	0.001474	0	100	random % 2
fluxsort	131072	32	0.000359	0.000364	0	100	random % 2
glidesort	131072	32	0.000443	0.000464	0	100	random % 2

quadsort	131072	32	0.001332	0.001362	0	100	signal
fluxsort	131072	32	0.001587	0.001602	0	100	signal
glidesort	131072	32	0.003688	0.003711	0	100	signal

quadsort	131072	32	0.001923	0.001947	0	100	exponential
fluxsort	131072	32	0.001281	0.001291	0	100	exponential
glidesort	131072	32	0.002313	0.002335	0	100	exponential