scandum / fluxsort

A fast branchless stable quicksort / mergesort hybrid that is highly adaptive.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Collaborating?

mlochbaum opened this issue · comments

First off, sorry again to have jumped to conclusions on Hacker News. As a programming language guy I'm in the market for a stable sort and Fluxsort seems to have the fundamentals down!

I wanted to know if you're interested in incorporating some or all of what I do into this repository. I'd like to help improve Fluxsort's worst cases and pattern recognition with techniques like pdqsort uses, and expand the interface to support other operations. Some possible topics, which I can also split into individual issues:

  • Add more sophisticated benchmarks with subtle patterns
  • Candidate randomization to avoid poor pivot selection for patterned as well as random data
  • Move the call stack into the memory buffer to avoid stack overflow and improve speed
  • Recognize signs of mostly-sorted inputs during pivot selection
  • Specialized integer code; maybe SIMD instructions, non-stable optimizations, or hybridization with counting sort

(pdqsort depends on instability a lot, but I think there are usually ways to avoid this)

As evidence that I have some clue what I'm talking about, here's a small improvement: the following inner loop for flux_partition gives about a 3% speedup in my benchmarks on random 1e6-element 4-byte int arrays.

size_t n=0;
val = cmp(ptx + 0, &piv) <= 0; pta[n] = ptx[0]; pts[0-n] = ptx[0]; n += val;
// etc.

pta += n; pts += 8-n;
ptx += 8;
  1. More sophisticated benchmarks are probably needed.
  2. Adding random pivots might prevent a beneficial switch to quadsort.
  3. I haven't been able to improve performance with a custom call stack previously, I'm reluctant to clutter the code if there's no performance gain.
  4. Detecting ordered distributions during median selection could be useful.
  5. I updated https://github.com/scandum/wolfsort yesterday to default to fluxsort, pretty good gain on 32 bit integers.

Some elegant way to sort random generic data (like 16 values) would be useful, pdqsort does this rather well, though I'm not sure if there's an added cost for other distributions.

That's a clever improvement, I'll do some testing and see about incorporating it.

I was able to adapt pdqsort's method for handling many equal values, which brings the performance on the "generic" case roughly in line with pdqsort. Pushed to https://github.com/mlochbaum/fluxsort . There doesn't seem to be any significant performance impact in other tested cases, but I found that r_array[cnt] = (unsigned)(cnt*cnt) % 10000 gets much slower, so I need to figure out why that is.

The strategy is the same but reversed left to right to account for the different recursion order. So, before partitioning it checks to see if the pivot is equal to the element after the current slice. The only extra step is to copy that element-after to the swap space before recursing into it. It might seem like the fact that pdqsort can put the pivot in the middle right after partitioning and fluxsort can't would make a difference, but it doesn't because the right partition is sorted (so it starts with the pivot) before visiting the left.

How much do you care about code style here? I've copy-pasted the partitioning code to match the existing style, but I'd find a version that uses macros for the loop unrolling and two copies easier to read.

As far as I can tell the flux_max function isn't stable?

Moving the maximum element to the end might be possible to accomplish in n comparisons and ~ log n swaps on randomized data. I'm having a hard time wrapping my head around it.

As for code style, I can be a bit particular, but I'm open to suggestions.

On my end, I've experimented with a pseudomedian of 25 that detects sorted partitions. It improves sorting of generic data slightly, but it's barely worth the trouble.

On second thought, I think the bubble sort would be the right way to go. n comps and n swaps.

No, I guess it isn't. I'd been thinking about ensuring that the maximum element doesn't get swapped with anything less than it, but the element at the end winds up out of place. It's not really necessary because you can track whether the current slice comes at the end of the array or not, it's just that you have to add cases when it's used. But this condition starts false and becomes true after the first base case, so it's not that complicated.

Finding the maximum is really fast with SIMD instructions (on appropriate types of course). Anything more complicated is probably too much overhead.

I reckon there will be a variety of solutions. I'll try to wrap my head around this partitioning logic in the next few days and get back to you.

Remark on psuedomedians: the order of the candidates does matter, and I think both pdqsort and flux are too systematic. I haven't analyzed the 15-candidate version, but the case with 3-by-3 is that when combining values, two that are on the same side always outweigh a third. If values in the middle are higher and those at the edges are lower, then Fluxsort ends up with two low groups and one high group. It will then pick the median from one of the low groups even though the true median is probably the highest value from a low group. pdqsort spreads out the groups but gets the same problem in the first round of medians, so that every group will end up with a low median. A scheme like 125/048/367 should be better. I also wonder if it might be preferable to just take the true median of 5 or 7 (I don't know how hard that is) instead. Fluxsort has a lot of overhead coming from median calculations and it seems to me like it should be possible to reduce it.

3x3 is as good at avoiding the worst case partition as 1x7 with fewer comparisons.

1x7 requires 6+5+4+3+2+1 = 21 comparisons worst case while 3x3 is at 3+3+3+3 = 12.

I guess a good baseline is to try to beat std:stable_sort for any distribution, this might be possible by adding run detection to the analyzer, since there's a tipping point where branch prediction outperforms branchless.

I was just wondering if it makes sense to search for an initial run at the beginning of flux_partition every time. Best case is a complete run of course. If it's a very large run you probably just switch to Quadsort right away. If there's nothing then you quit after 3-5 elements probably, and if there's a small run then at least you can partition it quickly with a binary search and the work isn't wasted. My worry is that I don't know how common perfect runs are relative to imperfect ones, especially after partitioning (but at least flux isn't destroying runs like unstable methods do). Timsort prioritizes perfect runs heavily, but I don't think it would be obvious if this was a bad idea all along.

SIMD is also relevant, since a run search can easily check multiple elements at a time. But it doesn't work in plain C because the compiler's not allowed to access elements after the stopping point.

It's possible to perform a branchless run analysis at a ~3% performance cost for random, useful for detecting pipe organ distributions.

I'm getting good performance with a branchless bubble sort. I typically use ./a.out 10000 100 1 1 for accurate readings, though I had to drop down to ./a.out 1000 100 10 1 with this code. Ought to be stable.

void FUNC(flux_max)(VAR *array, size_t nmemb, CMPFUNC *cmp)
{
        VAR swap, *ptb, *pta = array;

        while (--nmemb)
        {
                ptb = &pta[cmp(pta, pta + 1) > 0];

                swap = *pta; *pta++ = *ptb; *ptb = swap;
        }
}

Edit: Previous code was going out of bounds, probably why the readings were odd. Not ideal performance, ~3% slower.

I took a closer look at the actual partitioning logic, and I think the analyzer would take care of most cases where flux_max would be beneficial? Another concern is that it takes an extra n comparisons, which isn't a big deal, but it might be to some people.

As far as I can tell most of the performance gain can be obtained from the if (s_size == 0) check. I'm leaning towards a slightly different approach, I'll attach what I currently have.

One of the things I'm trying to do is to minimize stack allocation during recursion. In my current setup that does result in some unnecessary pivot calls, but it seems to slightly improve performance on random without a notable impact on generic.

fluxsort.txt

This seems to work to deal with the cnt * cnt % 10000 distribution for large arrays. Perhaps a bit heavy-handed. It's still not quite on par with pdqsort for 10M+ but it gets pretty close.

Should be possible to keep track of the previous pivot with a maximum_of_twentyfive routine and passing it along recursively, I'm looking into that next.

void FUNC(fluxsort)(void *array, size_t nmemb, CMPFUNC *cmp);

VAR FUNC(median_of_hundredtwentyeight)(VAR *array, size_t nmemb, CMPFUNC *cmp)
{
        VAR *pta, swap[128], tmp;
        size_t cnt, div = nmemb / 128;

        pta = array;

        for (cnt = 0 ; cnt < 128 ; cnt++)
        {
                swap[cnt] = pta[0];

                pta += div;
        }
        FUNC(fluxsort)(swap, 128, cmp);

        return swap[64];
}
VAR FUNC(median_of_twentyfive)(VAR *array, size_t nmemb, CMPFUNC *cmp)
{
        size_t v0, v1, v2, v3, v4, div = nmemb / 64;

        v0 = FUNC(median_of_five)(array, div *  4, div *  1, div *  2, div *  8, div * 10, cmp);
        v1 = FUNC(median_of_five)(array, div * 16, div * 12, div * 14, div * 18, div * 20, cmp);
        v2 = FUNC(median_of_five)(array, div * 32, div * 24, div * 30, div * 34, div * 38, cmp);
        v3 = FUNC(median_of_five)(array, div * 48, div * 42, div * 44, div * 50, div * 52, cmp);
        v4 = FUNC(median_of_five)(array, div * 60, div * 54, div * 56, div * 62, div * 63, cmp);

        return array[FUNC(median_of_five)(array, v2, v0, v1, v3, v4, cmp)];
}
                if (nmemb > 1024)
                {
                        if (nmemb > 65536)
                        {
                                piv = FUNC(median_of_hundredtwentyeight)(ptx, nmemb, cmp);
                        }
                        else
                        {
                                piv = FUNC(median_of_twentyfive)(ptx, nmemb, cmp);
                        }
                }
                else
                {
                        piv = FUNC(median_of_nine)(ptx, nmemb, cmp);
                }

I think that strategy is entirely reasonable, actually, and it doesn't even add much to the code size. It should be able to do the median in already-allocated memory (in extreme cases swap won't have enough space but array will). And the number of candidates can vary. I'll see if I can do some statistics and figure out how the number of candidates sorted relates to the ratio of sizes.

I still think switching to quadsort on a single bad partition isn't really viable. It should switch when the partition size is too large relative to the depth, for a cumulative measure. Statistics also needed here.

From this answer, medians follow a beta distribution: the probability that a median of m is found p of the way along the actual distribution is proportional to p^(m/2) * (1-p)^(m/2). And assuming the cost to sort is n log(n), the cost after partitioning is n log(n) plus n * (p log(p) + (1-p) log(1-p)). So given m, it's easy to numerically compute the coefficient k so that the expected cost is n*k. Then the value n where that cost is effective is m log(m) / k. But that relies on the cost of sorting having a constant coefficient; it doesn't, so this is an underestimate.

I computed this table of values, where the first column is n and the second is m. It tracks √n pretty closely, and the third column shows sqrt(n/(1+log_4(n))), which is a very close estimate. But the value m is an overestimate, and using a median half as large (corresponding to small-array sorting being four times as expensive) probably makes more sense. This suggests it's best to switch from median-of-15 at about 8000 elements. At this point the cost of computing the median size with count-leading-zeros and square root hardly matters.

If you're saving the sorted candidates between iterations then the math works out differently. You could either select more candidates at the start, or merge in candidates to get to the right amount at a lower cost (increasing the size by a factor of √2 if √n is still accurate).

That lines up pretty closely with my own pivot choices based on benchmarks. I might be able to write a dynamic pivot selection routine.

I ran some calculations and it seems like the cubic root of the partition size might be a good value for the pseudomedian.

I'm pretty happy with some of the changes I made so I'm thinking of pushing out a new version. How would you like me to credit you for your contributions? I've incorporated the loop unrolling and s_size == 0 optimization.

My branch uses the sqrt(n/(1+log_4(n))) / 2 value (but it computes the gap size, which works out to sqrt(2*n*(2+log_2(n)))). Decreasing the number of pivots definitely increases the time on shuffled data, so I think this is better than a cube root version.

I also added a pseudorandom value to each pivot candidate. Running a full PRNG step for each pivot adds too much overhead, so I split Marsaglia's Xorshift into three steps and do only one for each candidate. The adjacent candidates won't be independent, but the overall selection will end up with plenty of randomness. I'm thinking about how to apply something similar to the smaller pivot computations as well. As far as I can tell the only reason not to use randomize candidates is performance: the goal is to avoid patterns in pivot selection, and PRNGs are designed specifically for this purpose.

I'm not really concerned about the attribution (there's a record of what happened here, anyway). If you've copied any actual code instead of just ideas then you could add me to the license text.

I've also been looking at insertion sort. Replacing quadsort with insertion sort directly is slightly slower, but removing all the quadsort calls and running unguarded insertion sort on the entire array at the end is the same speed for random data or even slightly faster, and 2-3% faster after increasing FLUX_OUT to 32. It still needs a fallback in the case of bad pivot selection. If that exists, then at the end all partitions are either sorted or smaller than FLUX_OUT, and it's enough to run a guarded insertion sort on the first FLUX_OUT entries and an unguarded insertion sort on the rest. For testing I didn't do that—I didn't guard anything and it happened to work. I'll see if I can get a correct version.

Increasing FLUX_OUT to 32 increases the comparison count by quite a bit, I'm trying to keep performance balanced for both integers and strings.

I updated the source code with the discussed changes. I think I'll create a separate repository that purely focuses on benchmarking pivot selection.

I'm also very interested in pivot selection. I'd never have guessed that stopping at 9 (or 15 even!) candidates is measurably worse. I just added randomization to the 15-candidate version, based on this generator because I only use a few low bits and the smaller numbers scramble these faster. This randomization makes median_of_fifteen itself about 50% slower, but that comes out to around 0.1% because it's only used on large partitions.

Too much overhead to do this for 9 candidates. Maybe using a fixed random sequence is best here. The main thing is to avoid doing too badly on exactly periodic data, because that kind of thing will definitely show up. It's also worth looking at a 3-candidate version for the smallest partitions (64? 128?), since it means there's less pressure on the 9-candidate one to be as fast as possible.

The 1 in 65,536 chance you give for pseudomedian-of-9 failing is too optimistic. True, it requires 4 candidates to be in the bad range, but there are many combinations of candidates that will cause the failure so the probability is higher. The actual probability that the larger partition fails is 2 * f(f(1/17)), where f(x) = x^3 + 3*(1-x)*x^2 is the probability a median of 3 fails if each of the three has failure rate x. I multiplied by 2 since it can fail on either side, and the end result is always that most of the array is sorted by fallback. That's about 1 in 1687 (confirmed with a simulation), which is not that low considering it happens at every partition.

Median of 5 groups of 3 gives 1 in 51161, which is probably fine, and slightly better than 3 of 5.

I'm a bit hesitant about randomization since it might make it difficult to pinpoint bad patterns.

I'll run some tests to see for myself if the failure rate is 1 in 1687 for the pseudomedian of 9, I've wondered if I was too optimistic thinking it would be 0.0625 ^ 4.

I've switched to the pseudomedian of 25 for partitions above 1024 which seems to do fairly well.

I'm doing some testing with larger pseudomedians and what I've dubbed quasimedians (3x3x3, 5x5x5, etc) as well.

Name Items Type Best Average Loops Samples Median Comps Distribution
1 median 10000 32 0.000079 0.000104 10000 10 49.9 0.0 random
3 median 10000 32 0.000317 0.000374 10000 10 37.5 2.7 random
9 pseudo 10000 32 0.001491 0.001523 10000 10 26.5 10.7 random
11 median 10000 32 0.002646 0.002720 10000 10 22.6 34.0 random
15 pseudo 10000 32 0.002788 0.002869 10000 10 21.8 21.3 random
27 quasi 10000 32 0.004295 0.004416 10000 10 18.9 34.6 random
25 pseudo 10000 32 0.005524 0.005639 10000 10 17.6 48.1 random
49 pseudo 10000 32 0.009105 0.009657 10000 10 15.6 89.1 random
81 pseudo 10000 32 0.014094 0.014208 10000 10 12.0 180.4 random
125 quasi 10000 32 0.021470 0.021789 10000 10 9.6 248.0 random
121 pseudo 10000 32 0.020349 0.020645 10000 10 9.7 317.0 random
169 pseudo 10000 32 0.028128 0.028538 10000 10 8.2 511.1 random
225 pseudo 10000 32 0.036318 0.037130 10000 10 6.9 769.6 random
128 median 10000 32 0.047244 0.047437 10000 10 7.0 802.4 random
289 pseudo 10000 32 0.044491 0.045037 10000 10 6.1 1105.9 random
361 pseudo 10000 32 0.052802 0.053114 10000 10 5.5 1524.7 random
256 median 10000 32 0.101282 0.101520 10000 10 5.0 1856.7 random
441 pseudo 10000 32 0.062605 0.063302 10000 10 5.0 2035.7 random
529 pseudo 10000 32 0.073239 0.073720 10000 10 4.5 2661.3 random
1681 pseudo 10000 32 0.277746 0.279358 10000 10 2.5 14514.7 random

Nice! I computed tables of failure probability for median combinations. They're shown as negative log base two of the probability, so 20 is roughly one in a million and so on. This gives some indication of the power of the combination. 5-of-5 is 22.89, slightly stronger than 3-of-3-of-3 at 20.86. And 7 is weaker than 3-of-3. The failure probability in the first line can be changed but it looks like these relationships stay stable.

With brute force testing I come out at 1 in 1333.33 for the pseudomedian of 9.

In my own tests I haven't been able to beat pseudomedian of 9 with the median of 3 for small partitions.

Name Items Best Average Loops Samp Median Failure Comps Distribution
1 median 64 0.013947 0.013947 1000000 1 2500.3 125132 0.0 random
3 median 64 0.018938 0.018938 1000000 1 1874.2 22598 2.7 random
9 pseudo 64 0.038853 0.038853 1000000 1 1335.6 759 10.7 random
9 sloppy 64 0.063251 0.063251 1000000 1 1252.9 256 25.2 random
9 median 64 0.096236 0.096236 1000000 1 1228.6 187 22.6 random

9 median returns the center value, 9 sloppy returns the center value but will be off by 1 in 66% of cases.

I don't get the concern about randomization breaking detection of bad patterns. The only thing that can slow down a branchless quicksort is bad pivot selection, right? But what qualifies as bad depends entirely on what pivot selection algorithm is used. PRNG test suites basically stress-test them against as many possible patterns as possible, so adding randomness in a suitable way should mean the only real-world data that has any chance of breaking them is malicious input.

For detecting patterns where fluxsort would be better, the variable-length selection seems ideal since you can do whatever statistics you want on the sample. My method just adds relatively small random offsets to evenly-spaced values, so they're still in order and approximately evenly spaced. I think this is more robust for pattern detection because with exact even spacing you might hit a part of the array that isn't representative (maybe if values come from a table with rows of 5 and div happens to be a multiple of 5).

flux_analyze could also look at larger structure. If pairs of elements in adjacent blocks of size 2^n are mostly ascending or mostly descending, then that should make the merge pass on those blocks run quickly. Searching for this would make it possible to detect when the data is basically ordered but elements are randomly a few positions off, since after the first few merge passes the rest will be easy. But it would need to be tested by sampling to avoid too much overhead, I'd expect.

I guess pta = array + rand() % div; would do the trick in median_of_sqrt. I'd still rather avoid it without known killer patterns.

I've looked into run detection in flux_analyze, but doing it without a ~3% performance loss on random is tricky. I'll probably give it another shot at some point.

I also wrote a dynamic pseudomedian selector. I get the impression though that on large partitions cache pollution outweighs the cpu advantage, so getting the median of 128 or 256 with fluxsort seems to perform better. I'll post the code in case you want to play around with it.

VAR FUNC(flux_median)(VAR *array, size_t nmemb, size_t base, CMPFUNC *cmp)
{
        VAR swap[base];
        size_t div = nmemb / base;
        unsigned char x, y, mid = base / 2;
        unsigned char val, t[base];

        for (x = 0 ; x < base ; x++)
        {
                t[x] = 0;
                swap[x] = array[0]; array += div;
        }

        for (x = 0 ; x < base - 1 ; x++)
        {
                for (y = x + 1 ; y < base ; y++)
                {
                        val = cmp(swap + x, swap + y) > 0; t[x] += val; t[y] += !val;
                }
                if (t[x] == mid) return swap[x];
        }
        return swap[base - 1];
}

VAR FUNC(flux_pseudomedian)(VAR *array, size_t nmemb, size_t base, CMPFUNC *cmp)
{
        VAR swap[base];
        size_t div = nmemb / base;
        unsigned char cnt;

        for (cnt = 0 ; cnt < base ; cnt++)
        {
                swap[cnt] = FUNC(flux_median)(array, div, base, cmp); array += div;
        }
        return FUNC(flux_median)(swap, base, base, cmp);
}

VAR FUNC(flux_quasimedian)(VAR *array, size_t nmemb, size_t base, CMPFUNC *cmp)
{
        VAR swap[base];
        size_t div = nmemb / base;
        unsigned char cnt;

        for (cnt = 0 ; cnt < base ; cnt++)
        {
                swap[cnt] = FUNC(flux_pseudomedian)(array, div, base, cmp); array += div;
        }
        return FUNC(flux_median)(swap, base, base, cmp);
}

Example: piv = FUNC(flux_quasimedian)(ptx, nmemb, 5, cmp); to get the pseudomedian of 125.

Some small updates.

  1. Added crude offset randomization to FUNC(median_of_sqrt), not sure if finer control is beneficial.
  2. Added run detection to FUNC(flux_analyze), should detect most pipe-organ distributions, performance loss is minor.
  3. Small change to FUNC(tail_swap) that seems to improve performance slightly.
  4. Added storing of the pivot as previously discussed. The first pivot is ignored rather than calculating the maximum element.

Still need to look at a few other things, like your suggestion to keep partitioning the leftover of a reverse partition.

Some more updates.

  1. Added branchless parity merges to the small array sorting routine of quadsort. Increases overall performance by about 15%.
  2. The reverse partitioner keeps sorting the leftover with flux instead of switching to quadsort.
  3. Fluxsort defaults to quadsort if memory allocation fails, and quadsort switches to in-place rotations if memory allocation fails once more.

I think that's about all I do with fluxsort for now.