scandum / fluxsort

A fast branchless stable quicksort / mergesort hybrid that is highly adaptive.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

fluxsort comparison behavior

Voultapher opened this issue · comments

Hi, I've recently been doing some research on sort implementations and noticed that in my tests that fluxsort is not stable. The test and benchmark suite is written in Rust, the following test fails for fluxsort but passes for other stable sorts such as Rust slice::sort and C++ std::stable_sort but not for fluxsort. You can find the test code here https://github.com/Voultapher/sort-research-rs/blob/2642b98dc60cb6c9a6dbdec706a71ebdfbdf89e1/tests/main.rs#L219. In essence it packs two i32 into one u64 and extracts in the comparison function the two i32 again and only uses the first one as comparison and then later checks that the order of the ascending second elements remains in order.

Also on a sidenote, I'm seeing rather different benchmark results, where fluxsort pretty much never beats pdqsort.

Fluxsort passes my own stability tests.

Do you have the cmp macro defined for primitive comparisons, or are you using a comparison function? Things can get a bit finicky since fluxsort uses the C interface for sorts by default.

Indeed I incorrectly supplied the comparison value.

Here is a reproducer:

#include <fluxsort.h>

#include <stdint.h>
#include <string.h>

// This is demo code and will only work on little-endian systems.

void print_i32_slice(int32_t *vals, size_t len)
{
    printf("[");
    for (size_t i = 0; i < len; i += 2)
    {
        printf("(%d, %d)", vals[i], vals[i + 1]);
    }
    printf("]\n");
}

int cmp_unpack_u64_to_i32(const void *a_ptr, const void *b_ptr)
{
    int32_t a = *(int32_t *)(a_ptr);
    int32_t b = *(int32_t *)(b_ptr);

    printf("comparing a: %d b: %d\n", a, b);

    // if (a < b)
    // {
    //     return -1;
    // }
    // else if (a > b)
    // {
    //     return 1;
    // }

    // return 0;

    if (a < b)
    {
        return -1;
    }

    return 1;
}

int main()
{
    uint64_t original[2] = {4294967296, 8589934592};
    uint64_t v[2] = {};
    memcpy(v, original, sizeof(original));

    printf("original: ");
    print_i32_slice((int32_t *)original, 4);

    fluxsort((void *)v, 2, sizeof(uint64_t), cmp_unpack_u64_to_i32);

    printf("sorted: ");
    print_i32_slice((int32_t *)v, 4);

    return 0;
}

Rust internally and C++ externally use the is_less contract where a function returns a bool. I had incorrectly mapped this concept into the glue code, that uses a Rust closure as comparison function. When I supply the full values it passes the test.

Arguably this could be avoided by limiting comparisons to is_less ie. x = cmp(pta + 1, pta) < 0; instead of x = cmp(pta, pta + 1) > 0;. This ties into the theme of how to specify the comparison function. In your benchmarks you use a - b which can invoke UB, which for example manifests itself when sorting int32_t v[2] = {-2147483644, 5}; for example by yielding an incorrect sort result. Fundamentally this is the shortcoming of the qsort interface, by limiting yourself internally to is_less one could do:

if (a < b)
{
    return -1;
}

return 1;

And still retain the stability guarantee, while losing some performance but not the 50+% extra of using the fully correct comparison function. Or bite the bullet and change the comparison interface to use is_less instead of an int.

Internally my sorts only require the (a > b) comparison, which is the right way to go about it as that translates to both greater than and not stable, as well as being compatible with the qsort interface.

C++ messed up by not being compatible with C in this regard, and less than doesn't translate to stable since that would be (a <= b). So it should be c++ and rust biting the bullet, but I doubt that's going to happen.

a - b indeed has an issue with large numbers, I'm forced to use it however in the benchmark since it calls qsort, and qsort doesn't perform proper stability checks and uses a mix of <, <=, and >.

I'm not sure this is the right place to talk about standard library API design, to clarify Rust uses Ord as their interface and is_less is an internal detail.

Indeed you seem to have already done what I suggested but biased towards is_more instead of is_less.

a - b indeed has an issue with large numbers, I'm forced to use it however in the benchmark since it calls qsort, and qsort doesn't perform proper stability checks and uses a mix of <, <=, and >.

So what you are saying is, look at my stable algorithm and how it performs against this unstable sort, with a broken compare function? There is saying that writing fast code is easy, and writing correct code is hard. Looking at the documentation of qsort https://en.cppreference.com/w/c/algorithm/qsort it tells you to do:

int compare_ints(const void* a, const void* b)
{
    int arg1 = *(const int*)a;
    int arg2 = *(const int*)b;
 
    if (arg1 < arg2) return -1;
    if (arg1 > arg2) return 1;
    return 0;
 
    // return (arg1 > arg2) - (arg1 < arg2); // possible shortcut
    // return arg1 - arg2; // erroneous shortcut (fails if INT_MIN is present)
}

Imo if you want a fair comparison you ought to use the correct comparison function for both qsort and fluxsort, and if you want the fastest absolute performance and comparison to languages with faster comparison function abstractions like C++ and Rust changing the signature seem inevitable.

Not quite, in the benchmark against pdqsort the #define cmp(a,b) (*(a) > *(b)) macro is enabled for fluxsort and no comparison functions are involved.

In the case of qsort it doesn't particularly matter for the benchmark I'm running, but admittedly arg1 - arg2 is slightly favorable for fluxsort.

Ah good to know, I'll look into benchmarking with that.

I just came across your presentation and wanted to leave a few remarks for you, figured doing so here might be the best place.

Fluxsort does work on non-integers, but what I'm doing (probably in one of the more atrocious C programming practices) is treating pointers as either 32 or 64 bit integers. The code works out of the box for 32 and 64 bit systems if sizeof(char *) is used for C strings when calling fluxsort(). In the case of 64 bit pointers the pointers are copied as long long, which isn't an issue in C, and each long long is passed as a void type to the comparison function, which can correctly cast it to the pointer it actually is. The benchmark itself performs a test using C strings.

As for stability, some kind of definite proof would be useful. One possible issue when porting is that unguarded insert requires the first 2 elements to be sorted, which the tail_swap() function that calls it takes care of.

Testing for stability is pretty rigorous in the benchmark using a dedicated method for integers, and since identical strings have different pointers, it's also detected during the string test.

Anyways, a cool presentation, and good luck with the porting effort.

Thanks for feedback and kind words :)

You are correct that it works for some user types, however because it allocates memory with malloc void* it would underalign user types with larger than pointer alignment, which would be UB. So it's support for user defined types is limited. Plus there are two other properties it doesn't have that make a port without modifications unsuitable as general purpose sort in Rust. For one, a user may see diverging observations of what value was passed into the comparison function. If in a horribly malicious but legal way the user was sorting a Mutex<Option<Box<...>>> a Mutex holding either a null-pointer or a valid pointer, and would swap the content during a comparison from owned pointer to null-pointer freeing the allocation, when the user drops the slice after sorting it would free the allocation a second time -> double free UB. The other one would be a panic/exception during comparison, leaving the slice with potential duplicate elements, which again could lead to double free UB. Of course C has neither scope based destructors (RAII) nor unwinding, so for C it's perfectly fine for user provided types as long as their type has fundamental alignment.

I don't think I use malloc void* in fluxsort. The general idea in C would be to create a custom structure in fluxsort.h for each custom size needed. It is a bit obnoxious, but writing typeless code in C is even more so.

@Voultapher

I published a simplified version of quadsort with decent performance that might be useful to your porting effort. It's around 150 lines of code. It should be some orders of magnitude easier to create a formal proof for it as well.

https://github.com/scandum/piposort

@scandum unless I misunderstand the code or the relevant C rules,

VAR *swap = malloc(nmemb * sizeof(VAR));
will be under-aligned for types with alignment larger than the fundamental alignment. Even if the type switch in the initial function is adjusted.

Regarding piposort, thanks for the suggestion. I'll certainly take a look. Note, that I've largely abandoned the porting effort and am mostly focused on improving the 'components' of the existing implementations.

Not sure if we're talking about the same thing. In fluxsort.h you'd define:

typedef struct {char bytes[20];} struct160;
#define VAR struct160
#define FUNC(NAME) NAME##160
#define STRUCT(NAME) struct NAME##160

#include "fluxsort.c"

That should allow sorting any 20 byte custom structure. Would also need to add a case 20 in void fluxsort()

I'm talking about types with custom alignment:

#include <stdalign.h>
#include <stdlib.h>

struct {
    alignas(64) char bytes[8];
} CustomType;

void* x(size_t count) {
    // wrong
    // return malloc(sizeof(CustomType) * count);

    // correct
    return aligned_alloc(alignof(CustomType), sizeof(CustomType) * count);
}

Here the malloc call as done by fluxsort would under-align CustomType.

Thanks, that's not something I've had to deal with often. I'll look into it.

As for under alignment, it shouldn't make much of a performance difference since we're not dealing with random access. People could however use their own aligned alloc and call fluxsort_swap() directly.

I'm hesitant to do anything a-typical, since people tend to make strange and inaccurate claims already. I assume it's the combination of a tempting fast algorithm, and the annoyance of trying to understand the code.

I think it's fine to leave it as is, while adding a note to the documentation. My understanding is, that for certain types such as SIMD registers it's UB to allocate and then use them if they are under-aligned.

That's a good point. I'll do some actual testing (since I can intentionally misalign the data), and at the very least suggest its usage in a code comment.