faster hash calculations nad other optimizations.

Question

faster hash calculations nad other optimizations.

kpckpc opened this issue 2 years ago · comments

Edit: I made a mistake. There is a 0.03% change of failing. An additional carry can corrupt the hash. This combination of colors apparently never happened in my testset.

Not reporting an issue, but wanted to drop some SWAR magic, to calculate the hash faster. Please note this is obeying the current hash function. This is not a proposal for a new hash function. Also probably not to include in the reference implementation, but might be used for people generating optimized versions.

#include <stdint.h>
#include <endian.h>
#if __BYTE_ORDER == __LITTLE_ENDIAN
  #define QOI_COLOR_HASH(C) \
    ((uint32_t) ((C.v & 0x3f3f3f3ful) * 0xc141c2cul - ((C.v & 0x200000ul) << 5)) >> 26)
#else
  #define QOI_COLOR_HASH(C) \
    ((uint32_t) ((C.v & 0x3f3f3f3ful) * 0x2c1c140cul) >> 26)
#endif

The magic constant for little endian being: (11|(7<<8)|(5<<16)|(3<<24))<<2, for big endian, the factors reversed;
All byte products end up in the msb. The additional shift of 2 cause the two high bits to drop off (i.e. a free modulo 64)
For little endian, there is 1 bit of spillover from the blue channel that needs to be corrected, hence the subtraction.

The generated assembly for the hash on x86_64 (the compiler first does the shift and then the masking)

        and     eax, 1061109567
        imul    eax, eax, 202644524
        shl     edi, 3            \
        and     edi, 1621156192    } 3 additional instructions to correct for spillover
        sub     eax, edi          /
        shr     eax, 26

I have tried all kinds of optimizations, to speed it up, including SSE/AVX tricks, to detect rows of rgb(a) opcodes, and to perform hash calculations in parallel, but gcc -Ofast -march=native still outsmarts me with a pretty clean gcc code with the following changes.
the following resulted in a roughly 180% increase:

split for loops into one for 3 and one for 4 bytes, to remove if-branching from inner loop and use register read/writes where possible
use pointers instead of p and px_pos
remove last else-if check (RUN is only option left, so else is enough)
combine run to a while(run) inside the QOI_OP_RUN, to cause gcc to generate very efficient vectorization (also for 3 channel data)
move indexing inside every if, except QOI_OP_INDEX and COI_UP_RUN, to prevent redundant hash calculation.
finally, reorder the ifs. This might depend on the dataset. For me this order was the fastest, based on collected statistics: QOI_OP_LUMA, QOI_OP_INDEX, QOI_OP_DIFF, QOI_OP_RGBA, QOI_OP_RGB, QOI_OP_RUN.

As said, just wanted to drop this. Definitely not intended for inclusion in the reference version. This should remain readable.