scandum / fluxsort

A fast branchless stable quicksort / mergesort hybrid that is highly adaptive.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Suboptimal code-gen in the fundamental branchless-swap building block

Voultapher opened this issue · comments

The fundamental branchless swap_if code produces suboptimal code on x86-64. I ported it to Rust and noticed that changing it yielded a 50% performance uplift for that function on Zen3, this will of course depend on the the hardware, but cmov seems to yield better results than setl/setg style code that is currently being produced. Probably helped by doing 8 instead of 10 instructions.

Here is the current version:

And here is the version that produces cmov code:

I think if you can find a way to reliably produce cmov instructions like LLVM does, you should see a noticeable speed improvement.

I looked into that in the past, but it doesn't produce good results on my own system. I'm not quite sure whether the code is at fault or the compiler.

Is there any definite consensus on the right way to perform branchless swaps?

I'm not sure there is consensus, but I saw a very significant speedup with cmov vs setl/ge code on Zen3, Broadwell, Skylake and on Firestorm (M1) LLVM was already producing csel code for both versions. How did you test it? Because I notice you no_inline the comparison function and that has disastrous effects on performance. Here the difference to languages with template instantiation / monomorphization is the most acute. If I understand correctly you pull in everything into the header, akin to header only libraries. But even then LTO should level the playing field I guess.

When it comes to performance testing I always uncomment this line in bench.c

//#define cmp(a,b) (*(a) > *(b)) // uncomment for fast primitive comparisons

That allows a fair comparison against c++ sorts.