Collaborating?

Question

Collaborating?

mlochbaum opened this issue 3 years ago · comments

First off, sorry again to have jumped to conclusions on Hacker News. As a programming language guy I'm in the market for a stable sort and Fluxsort seems to have the fundamentals down!

I wanted to know if you're interested in incorporating some or all of what I do into this repository. I'd like to help improve Fluxsort's worst cases and pattern recognition with techniques like pdqsort uses, and expand the interface to support other operations. Some possible topics, which I can also split into individual issues:

Add more sophisticated benchmarks with subtle patterns
Candidate randomization to avoid poor pivot selection for patterned as well as random data
Move the call stack into the memory buffer to avoid stack overflow and improve speed
Recognize signs of mostly-sorted inputs during pivot selection
Specialized integer code; maybe SIMD instructions, non-stable optimizations, or hybridization with counting sort

(pdqsort depends on instability a lot, but I think there are usually ways to avoid this)

As evidence that I have some clue what I'm talking about, here's a small improvement: the following inner loop for flux_partition gives about a 3% speedup in my benchmarks on random 1e6-element 4-byte int arrays.

size_t n=0;
val = cmp(ptx + 0, &piv) <= 0; pta[n] = ptx[0]; pts[0-n] = ptx[0]; n += val;
// etc.

pta += n; pts += 8-n;
ptx += 8;

Igor van den Hoven · Answer 1 · Mon Jul 26 2021 22:07:13 GMT+0800 (China Standard Time)

More sophisticated benchmarks are probably needed.
Adding random pivots might prevent a beneficial switch to quadsort.
I haven't been able to improve performance with a custom call stack previously, I'm reluctant to clutter the code if there's no performance gain.
Detecting ordered distributions during median selection could be useful.
I updated https://github.com/scandum/wolfsort yesterday to default to fluxsort, pretty good gain on 32 bit integers.

Some elegant way to sort random generic data (like 16 values) would be useful, pdqsort does this rather well, though I'm not sure if there's an added cost for other distributions.

That's a clever improvement, I'll do some testing and see about incorporating it.

Marshall Lochbaum · Answer 2 · Fri Jul 30 2021 09:14:29 GMT+0800 (China Standard Time)

I was able to adapt pdqsort's method for handling many equal values, which brings the performance on the "generic" case roughly in line with pdqsort. Pushed to https://github.com/mlochbaum/fluxsort . There doesn't seem to be any significant performance impact in other tested cases, but I found that r_array[cnt] = (unsigned)(cnt*cnt) % 10000 gets much slower, so I need to figure out why that is.

The strategy is the same but reversed left to right to account for the different recursion order. So, before partitioning it checks to see if the pivot is equal to the element after the current slice. The only extra step is to copy that element-after to the swap space before recursing into it. It might seem like the fact that pdqsort can put the pivot in the middle right after partitioning and fluxsort can't would make a difference, but it doesn't because the right partition is sorted (so it starts with the pivot) before visiting the left.

How much do you care about code style here? I've copy-pasted the partitioning code to match the existing style, but I'd find a version that uses macros for the loop unrolling and two copies easier to read.

Igor van den Hoven · Answer 3 · Fri Jul 30 2021 21:18:02 GMT+0800 (China Standard Time)

As far as I can tell the flux_max function isn't stable?

Moving the maximum element to the end might be possible to accomplish in n comparisons and ~ log n swaps on randomized data. I'm having a hard time wrapping my head around it.

As for code style, I can be a bit particular, but I'm open to suggestions.

On my end, I've experimented with a pseudomedian of 25 that detects sorted partitions. It improves sorting of generic data slightly, but it's barely worth the trouble.

Igor van den Hoven · Answer 4 · Fri Jul 30 2021 21:30:41 GMT+0800 (China Standard Time)

On second thought, I think the bubble sort would be the right way to go. n comps and n swaps.

Marshall Lochbaum · Answer 5 · Fri Jul 30 2021 21:32:35 GMT+0800 (China Standard Time)

No, I guess it isn't. I'd been thinking about ensuring that the maximum element doesn't get swapped with anything less than it, but the element at the end winds up out of place. It's not really necessary because you can track whether the current slice comes at the end of the array or not, it's just that you have to add cases when it's used. But this condition starts false and becomes true after the first base case, so it's not that complicated.

Finding the maximum is really fast with SIMD instructions (on appropriate types of course). Anything more complicated is probably too much overhead.

Igor van den Hoven · Answer 6 · Fri Jul 30 2021 21:43:52 GMT+0800 (China Standard Time)

I reckon there will be a variety of solutions. I'll try to wrap my head around this partitioning logic in the next few days and get back to you.

Marshall Lochbaum · Answer 7 · Fri Jul 30 2021 21:47:05 GMT+0800 (China Standard Time)

Remark on psuedomedians: the order of the candidates does matter, and I think both pdqsort and flux are too systematic. I haven't analyzed the 15-candidate version, but the case with 3-by-3 is that when combining values, two that are on the same side always outweigh a third. If values in the middle are higher and those at the edges are lower, then Fluxsort ends up with two low groups and one high group. It will then pick the median from one of the low groups even though the true median is probably the highest value from a low group. pdqsort spreads out the groups but gets the same problem in the first round of medians, so that every group will end up with a low median. A scheme like 125/048/367 should be better. I also wonder if it might be preferable to just take the true median of 5 or 7 (I don't know how hard that is) instead. Fluxsort has a lot of overhead coming from median calculations and it seems to me like it should be possible to reduce it.

Igor van den Hoven · Answer 8 · Fri Jul 30 2021 22:20:34 GMT+0800 (China Standard Time)

3x3 is as good at avoiding the worst case partition as 1x7 with fewer comparisons.

1x7 requires 6+5+4+3+2+1 = 21 comparisons worst case while 3x3 is at 3+3+3+3 = 12.

I guess a good baseline is to try to beat std:stable_sort for any distribution, this might be possible by adding run detection to the analyzer, since there's a tipping point where branch prediction outperforms branchless.

Marshall Lochbaum · Answer 9 · Fri Jul 30 2021 22:55:07 GMT+0800 (China Standard Time)

I was just wondering if it makes sense to search for an initial run at the beginning of flux_partition every time. Best case is a complete run of course. If it's a very large run you probably just switch to Quadsort right away. If there's nothing then you quit after 3-5 elements probably, and if there's a small run then at least you can partition it quickly with a binary search and the work isn't wasted. My worry is that I don't know how common perfect runs are relative to imperfect ones, especially after partitioning (but at least flux isn't destroying runs like unstable methods do). Timsort prioritizes perfect runs heavily, but I don't think it would be obvious if this was a bad idea all along.

SIMD is also relevant, since a run search can easily check multiple elements at a time. But it doesn't work in plain C because the compiler's not allowed to access elements after the stopping point.

Igor van den Hoven · Answer 10 · Sat Jul 31 2021 00:55:00 GMT+0800 (China Standard Time)

It's possible to perform a branchless run analysis at a ~3% performance cost for random, useful for detecting pipe organ distributions.

Igor van den Hoven · Answer 11 · Sat Jul 31 2021 07:27:05 GMT+0800 (China Standard Time)

I'm getting good performance with a branchless bubble sort. I typically use ./a.out 10000 100 1 1 for accurate readings, though I had to drop down to ./a.out 1000 100 10 1 with this code. Ought to be stable.

void FUNC(flux_max)(VAR *array, size_t nmemb, CMPFUNC *cmp)
{
        VAR swap, *ptb, *pta = array;

        while (--nmemb)
        {
                ptb = &pta[cmp(pta, pta + 1) > 0];

                swap = *pta; *pta++ = *ptb; *ptb = swap;
        }
}

Edit: Previous code was going out of bounds, probably why the readings were odd. Not ideal performance, ~3% slower.

Igor van den Hoven · Answer 12 · Sat Jul 31 2021 10:05:21 GMT+0800 (China Standard Time)

I took a closer look at the actual partitioning logic, and I think the analyzer would take care of most cases where flux_max would be beneficial? Another concern is that it takes an extra n comparisons, which isn't a big deal, but it might be to some people.

As far as I can tell most of the performance gain can be obtained from the if (s_size == 0) check. I'm leaning towards a slightly different approach, I'll attach what I currently have.

One of the things I'm trying to do is to minimize stack allocation during recursion. In my current setup that does result in some unnecessary pivot calls, but it seems to slightly improve performance on random without a notable impact on generic.

fluxsort.txt

Igor van den Hoven · Answer 13 · Sun Aug 01 2021 20:29:32 GMT+0800 (China Standard Time)

This seems to work to deal with the cnt * cnt % 10000 distribution for large arrays. Perhaps a bit heavy-handed. It's still not quite on par with pdqsort for 10M+ but it gets pretty close.

Should be possible to keep track of the previous pivot with a maximum_of_twentyfive routine and passing it along recursively, I'm looking into that next.

void FUNC(fluxsort)(void *array, size_t nmemb, CMPFUNC *cmp);

VAR FUNC(median_of_hundredtwentyeight)(VAR *array, size_t nmemb, CMPFUNC *cmp)
{
        VAR *pta, swap[128], tmp;
        size_t cnt, div = nmemb / 128;

        pta = array;

        for (cnt = 0 ; cnt < 128 ; cnt++)
        {
                swap[cnt] = pta[0];

                pta += div;
        }
        FUNC(fluxsort)(swap, 128, cmp);

        return swap[64];
}

VAR FUNC(median_of_twentyfive)(VAR *array, size_t nmemb, CMPFUNC *cmp)
{
        size_t v0, v1, v2, v3, v4, div = nmemb / 64;

        v0 = FUNC(median_of_five)(array, div *  4, div *  1, div *  2, div *  8, div * 10, cmp);
        v1 = FUNC(median_of_five)(array, div * 16, div * 12, div * 14, div * 18, div * 20, cmp);
        v2 = FUNC(median_of_five)(array, div * 32, div * 24, div * 30, div * 34, div * 38, cmp);
        v3 = FUNC(median_of_five)(array, div * 48, div * 42, div * 44, div * 50, div * 52, cmp);
        v4 = FUNC(median_of_five)(array, div * 60, div * 54, div * 56, div * 62, div * 63, cmp);

        return array[FUNC(median_of_five)(array, v2, v0, v1, v3, v4, cmp)];
}

                if (nmemb > 1024)
                {
                        if (nmemb > 65536)
                        {
                                piv = FUNC(median_of_hundredtwentyeight)(ptx, nmemb, cmp);
                        }
                        else
                        {
                                piv = FUNC(median_of_twentyfive)(ptx, nmemb, cmp);
                        }
                }
                else
                {
                        piv = FUNC(median_of_nine)(ptx, nmemb, cmp);
                }

Marshall Lochbaum · Answer 14 · Sun Aug 01 2021 21:01:10 GMT+0800 (China Standard Time)

I think that strategy is entirely reasonable, actually, and it doesn't even add much to the code size. It should be able to do the median in already-allocated memory (in extreme cases swap won't have enough space but array will). And the number of candidates can vary. I'll see if I can do some statistics and figure out how the number of candidates sorted relates to the ratio of sizes.

I still think switching to quadsort on a single bad partition isn't really viable. It should switch when the partition size is too large relative to the depth, for a cumulative measure. Statistics also needed here.

Marshall Lochbaum · Answer 15 · Sun Aug 01 2021 23:30:24 GMT+0800 (China Standard Time)

From this answer, medians follow a beta distribution: the probability that a median of m is found p of the way along the actual distribution is proportional to p^(m/2) * (1-p)^(m/2). And assuming the cost to sort is n log(n), the cost after partitioning is n log(n) plus n * (p log(p) + (1-p) log(1-p)). So given m, it's easy to numerically compute the coefficient k so that the expected cost is n*k. Then the value n where that cost is effective is m log(m) / k. But that relies on the cost of sorting having a constant coefficient; it doesn't, so this is an underestimate.

I computed this table of values, where the first column is n and the second is m. It tracks √n pretty closely, and the third column shows sqrt(n/(1+log_4(n))), which is a very close estimate. But the value m is an overestimate, and using a median half as large (corresponding to small-array sorting being four times as expensive) probably makes more sense. This suggests it's best to switch from median-of-15 at about 8000 elements. At this point the cost of computing the median size with count-leading-zeros and square root hardly matters.

If you're saving the sorted candidates between iterations then the math works out differently. You could either select more candidates at the start, or merge in candidates to get to the right amount at a lower cost (increasing the size by a factor of √2 if √n is still accurate).

Igor van den Hoven · Answer 16 · Mon Aug 02 2021 00:06:55 GMT+0800 (China Standard Time)

That lines up pretty closely with my own pivot choices based on benchmarks. I might be able to write a dynamic pivot selection routine.

Igor van den Hoven · Answer 17 · Mon Aug 02 2021 20:53:35 GMT+0800 (China Standard Time)

I ran some calculations and it seems like the cubic root of the partition size might be a good value for the pseudomedian.

I'm pretty happy with some of the changes I made so I'm thinking of pushing out a new version. How would you like me to credit you for your contributions? I've incorporated the loop unrolling and s_size == 0 optimization.

Marshall Lochbaum · Answer 18 · Mon Aug 02 2021 21:49:58 GMT+0800 (China Standard Time)

My branch uses the sqrt(n/(1+log_4(n))) / 2 value (but it computes the gap size, which works out to sqrt(2*n*(2+log_2(n)))). Decreasing the number of pivots definitely increases the time on shuffled data, so I think this is better than a cube root version.

I also added a pseudorandom value to each pivot candidate. Running a full PRNG step for each pivot adds too much overhead, so I split Marsaglia's Xorshift into three steps and do only one for each candidate. The adjacent candidates won't be independent, but the overall selection will end up with plenty of randomness. I'm thinking about how to apply something similar to the smaller pivot computations as well. As far as I can tell the only reason not to use randomize candidates is performance: the goal is to avoid patterns in pivot selection, and PRNGs are designed specifically for this purpose.

I'm not really concerned about the attribution (there's a record of what happened here, anyway). If you've copied any actual code instead of just ideas then you could add me to the license text.

Marshall Lochbaum · Answer 19 · Mon Aug 02 2021 22:08:50 GMT+0800 (China Standard Time)

I've also been looking at insertion sort. Replacing quadsort with insertion sort directly is slightly slower, but removing all the quadsort calls and running unguarded insertion sort on the entire array at the end is the same speed for random data or even slightly faster, and 2-3% faster after increasing FLUX_OUT to 32. It still needs a fallback in the case of bad pivot selection. If that exists, then at the end all partitions are either sorted or smaller than FLUX_OUT, and it's enough to run a guarded insertion sort on the first FLUX_OUT entries and an unguarded insertion sort on the rest. For testing I didn't do that—I didn't guard anything and it happened to work. I'll see if I can get a correct version.

Igor van den Hoven · Answer 20 · Wed Aug 04 2021 00:02:19 GMT+0800 (China Standard Time)

Increasing FLUX_OUT to 32 increases the comparison count by quite a bit, I'm trying to keep performance balanced for both integers and strings.

I updated the source code with the discussed changes. I think I'll create a separate repository that purely focuses on benchmarking pivot selection.

Marshall Lochbaum · Answer 21 · Wed Aug 04 2021 10:01:24 GMT+0800 (China Standard Time)

I'm also very interested in pivot selection. I'd never have guessed that stopping at 9 (or 15 even!) candidates is measurably worse. I just added randomization to the 15-candidate version, based on this generator because I only use a few low bits and the smaller numbers scramble these faster. This randomization makes median_of_fifteen itself about 50% slower, but that comes out to around 0.1% because it's only used on large partitions.

Too much overhead to do this for 9 candidates. Maybe using a fixed random sequence is best here. The main thing is to avoid doing too badly on exactly periodic data, because that kind of thing will definitely show up. It's also worth looking at a 3-candidate version for the smallest partitions (64? 128?), since it means there's less pressure on the 9-candidate one to be as fast as possible.

The 1 in 65,536 chance you give for pseudomedian-of-9 failing is too optimistic. True, it requires 4 candidates to be in the bad range, but there are many combinations of candidates that will cause the failure so the probability is higher. The actual probability that the larger partition fails is 2 * f(f(1/17)), where f(x) = x^3 + 3*(1-x)*x^2 is the probability a median of 3 fails if each of the three has failure rate x. I multiplied by 2 since it can fail on either side, and the end result is always that most of the array is sorted by fallback. That's about 1 in 1687 (confirmed with a simulation), which is not that low considering it happens at every partition.

Median of 5 groups of 3 gives 1 in 51161, which is probably fine, and slightly better than 3 of 5.

Igor van den Hoven · Answer 22 · Wed Aug 04 2021 20:30:58 GMT+0800 (China Standard Time)

I'm a bit hesitant about randomization since it might make it difficult to pinpoint bad patterns.

I'll run some tests to see for myself if the failure rate is 1 in 1687 for the pseudomedian of 9, I've wondered if I was too optimistic thinking it would be 0.0625 ^ 4.

I've switched to the pseudomedian of 25 for partitions above 1024 which seems to do fairly well.

I'm doing some testing with larger pseudomedians and what I've dubbed quasimedians (3x3x3, 5x5x5, etc) as well.

Name	Items	Type	Best	Average	Loops	Samples	Median	Comps	Distribution
1 median	10000	32	0.000079	0.000104	10000	10	49.9	0.0	random
3 median	10000	32	0.000317	0.000374	10000	10	37.5	2.7	random
9 pseudo	10000	32	0.001491	0.001523	10000	10	26.5	10.7	random
11 median	10000	32	0.002646	0.002720	10000	10	22.6	34.0	random
15 pseudo	10000	32	0.002788	0.002869	10000	10	21.8	21.3	random
27 quasi	10000	32	0.004295	0.004416	10000	10	18.9	34.6	random
25 pseudo	10000	32	0.005524	0.005639	10000	10	17.6	48.1	random
49 pseudo	10000	32	0.009105	0.009657	10000	10	15.6	89.1	random
81 pseudo	10000	32	0.014094	0.014208	10000	10	12.0	180.4	random
125 quasi	10000	32	0.021470	0.021789	10000	10	9.6	248.0	random
121 pseudo	10000	32	0.020349	0.020645	10000	10	9.7	317.0	random
169 pseudo	10000	32	0.028128	0.028538	10000	10	8.2	511.1	random
225 pseudo	10000	32	0.036318	0.037130	10000	10	6.9	769.6	random
128 median	10000	32	0.047244	0.047437	10000	10	7.0	802.4	random
289 pseudo	10000	32	0.044491	0.045037	10000	10	6.1	1105.9	random
361 pseudo	10000	32	0.052802	0.053114	10000	10	5.5	1524.7	random
256 median	10000	32	0.101282	0.101520	10000	10	5.0	1856.7	random
441 pseudo	10000	32	0.062605	0.063302	10000	10	5.0	2035.7	random
529 pseudo	10000	32	0.073239	0.073720	10000	10	4.5	2661.3	random
1681 pseudo	10000	32	0.277746	0.279358	10000	10	2.5	14514.7	random

Marshall Lochbaum · Answer 23 · Wed Aug 04 2021 21:15:41 GMT+0800 (China Standard Time)

Nice! I computed tables of failure probability for median combinations. They're shown as negative log base two of the probability, so 20 is roughly one in a million and so on. This gives some indication of the power of the combination. 5-of-5 is 22.89, slightly stronger than 3-of-3-of-3 at 20.86. And 7 is weaker than 3-of-3. The failure probability in the first line can be changed but it looks like these relationships stay stable.

Igor van den Hoven · Answer 24 · Wed Aug 04 2021 23:09:17 GMT+0800 (China Standard Time)

With brute force testing I come out at 1 in 1333.33 for the pseudomedian of 9.

In my own tests I haven't been able to beat pseudomedian of 9 with the median of 3 for small partitions.

Name	Items	Best	Average	Loops	Samp	Median	Failure	Comps	Distribution
1 median	64	0.013947	0.013947	1000000	1	2500.3	125132	0.0	random
3 median	64	0.018938	0.018938	1000000	1	1874.2	22598	2.7	random
9 pseudo	64	0.038853	0.038853	1000000	1	1335.6	759	10.7	random
9 sloppy	64	0.063251	0.063251	1000000	1	1252.9	256	25.2	random
9 median	64	0.096236	0.096236	1000000	1	1228.6	187	22.6	random

9 median returns the center value, 9 sloppy returns the center value but will be off by 1 in 66% of cases.

Marshall Lochbaum · Answer 25 · Thu Aug 05 2021 02:02:10 GMT+0800 (China Standard Time)

I don't get the concern about randomization breaking detection of bad patterns. The only thing that can slow down a branchless quicksort is bad pivot selection, right? But what qualifies as bad depends entirely on what pivot selection algorithm is used. PRNG test suites basically stress-test them against as many possible patterns as possible, so adding randomness in a suitable way should mean the only real-world data that has any chance of breaking them is malicious input.

For detecting patterns where fluxsort would be better, the variable-length selection seems ideal since you can do whatever statistics you want on the sample. My method just adds relatively small random offsets to evenly-spaced values, so they're still in order and approximately evenly spaced. I think this is more robust for pattern detection because with exact even spacing you might hit a part of the array that isn't representative (maybe if values come from a table with rows of 5 and div happens to be a multiple of 5).

flux_analyze could also look at larger structure. If pairs of elements in adjacent blocks of size 2^n are mostly ascending or mostly descending, then that should make the merge pass on those blocks run quickly. Searching for this would make it possible to detect when the data is basically ordered but elements are randomly a few positions off, since after the first few merge passes the rest will be easy. But it would need to be tested by sampling to avoid too much overhead, I'd expect.

Igor van den Hoven · Answer 26 · Fri Aug 06 2021 04:11:44 GMT+0800 (China Standard Time)

I guess pta = array + rand() % div; would do the trick in median_of_sqrt. I'd still rather avoid it without known killer patterns.

I've looked into run detection in flux_analyze, but doing it without a ~3% performance loss on random is tricky. I'll probably give it another shot at some point.

I also wrote a dynamic pseudomedian selector. I get the impression though that on large partitions cache pollution outweighs the cpu advantage, so getting the median of 128 or 256 with fluxsort seems to perform better. I'll post the code in case you want to play around with it.

VAR FUNC(flux_median)(VAR *array, size_t nmemb, size_t base, CMPFUNC *cmp)
{
        VAR swap[base];
        size_t div = nmemb / base;
        unsigned char x, y, mid = base / 2;
        unsigned char val, t[base];

        for (x = 0 ; x < base ; x++)
        {
                t[x] = 0;
                swap[x] = array[0]; array += div;
        }

        for (x = 0 ; x < base - 1 ; x++)
        {
                for (y = x + 1 ; y < base ; y++)
                {
                        val = cmp(swap + x, swap + y) > 0; t[x] += val; t[y] += !val;
                }
                if (t[x] == mid) return swap[x];
        }
        return swap[base - 1];
}

VAR FUNC(flux_pseudomedian)(VAR *array, size_t nmemb, size_t base, CMPFUNC *cmp)
{
        VAR swap[base];
        size_t div = nmemb / base;
        unsigned char cnt;

        for (cnt = 0 ; cnt < base ; cnt++)
        {
                swap[cnt] = FUNC(flux_median)(array, div, base, cmp); array += div;
        }
        return FUNC(flux_median)(swap, base, base, cmp);
}

VAR FUNC(flux_quasimedian)(VAR *array, size_t nmemb, size_t base, CMPFUNC *cmp)
{
        VAR swap[base];
        size_t div = nmemb / base;
        unsigned char cnt;

        for (cnt = 0 ; cnt < base ; cnt++)
        {
                swap[cnt] = FUNC(flux_pseudomedian)(array, div, base, cmp); array += div;
        }
        return FUNC(flux_median)(swap, base, base, cmp);
}

Example: piv = FUNC(flux_quasimedian)(ptx, nmemb, 5, cmp); to get the pseudomedian of 125.

Igor van den Hoven · Answer 27 · Sun Aug 15 2021 01:15:25 GMT+0800 (China Standard Time)

Some small updates.

Added crude offset randomization to FUNC(median_of_sqrt), not sure if finer control is beneficial.
Added run detection to FUNC(flux_analyze), should detect most pipe-organ distributions, performance loss is minor.
Small change to FUNC(tail_swap) that seems to improve performance slightly.
Added storing of the pivot as previously discussed. The first pivot is ignored rather than calculating the maximum element.

Still need to look at a few other things, like your suggestion to keep partitioning the leftover of a reverse partition.

Igor van den Hoven · Answer 28 · Fri Sep 24 2021 23:33:36 GMT+0800 (China Standard Time)

Some more updates.

Added branchless parity merges to the small array sorting routine of quadsort. Increases overall performance by about 15%.
The reverse partitioner keeps sorting the leftover with flux instead of switching to quadsort.
Fluxsort defaults to quadsort if memory allocation fails, and quadsort switches to in-place rotations if memory allocation fails once more.

I think that's about all I do with fluxsort for now.