lemire / streamvbyte

Fast integer compression in C using the StreamVByte codec

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support for Signed Types Integrated in the CODECs

iiSeymour opened this issue · comments

I have an interest in using streamvbyte with signed types (specifically int16) and have created a Python wrapper which supports this however the overhead is quite large as you would imagine iiSeymour/pystreamvbyte#1.

Native support for an efficient zigzag encoding/decoding for int16 and int32 would be great. I imagine this is something that would vectorise well?

Yes. I expect it would vectorize very well.

commented

Zigzag support for int32 is implemented in this branch: here

specifically:
zigzag ops
decoder
encoder

@aqrit Is there are clean way to bring this into the main streamvbyte repo? Want to issue a PR? Or do we do something dirtier?

commented

Is there are clean way to bring this into the main streamvbyte repo?

Not really, I touched everything. It currently lacks the ARM routines. My original plan was to implement a AVX2 table-less version then package it up... but I never got around to it.

@iiSeymour

What about this then... As a stopgap measure, we add zigzag_decode and zigzag_encode functions to the library with definitions such as these...

https://gist.github.com/lemire/b6437fbd193395d8e4ccac1a5b2e50cc

This is such simple code that the autovectorizer should do a fine job (with GCC using -O3, with clang use -O2).

Then you can call these functions as needed, to convert your signed integers into unsigned integers and back.

It is not going to be as fast as having zigzag encoding integrating in the main functions, but it should be a lot faster than doing the conversion in Python or whatever else.

We would leave the issue open, to be resolved with better code... but you'd have an 'ok' fix meanwhile.

Thoughts?

Sounds like a plan 👍

Implemented in the latest release. Make sure to compile with aggressive optimization flags and it should be quite fast.

Testing some implementations we have internally I've found the following code to perform quite well:

https://gist.github.com/jorj1988/a2f2f14719d46074900dd563dc31a1c7

It does a delta >> zigzag >> expand to 32 bit >> streambyte encode

Thanks. That is a useful reference.

With delta zigzag as per #29 I see about a 40% improvement over my Python implementation with int32 arrays and 20% with int16 (the extra cast is expensive).

@jorj1988 thanks, I'll have a go at benchmarking your implementation and see what kind of performance improvements it yields.

@aqrit I would certainly benefit from an AVX2 implementation if you're still up for it.

A look at the performance across types iiSeymour/pystreamvbyte#1 (comment)

I had a quick go at modifying the tests/perf utility to reproduce in C.

streamvbyte_encoding
streamvbyte_decoding

I understood the 'int32' perf as in "you need to apply zigzag and then compress". However, I am puzzled regarding 'int16'. What do you do exactly?

My primary use case is working with int16 data and that means an expensive copy to cast up in Python before every encode/decode.

commented

For encoding 16-bit words a shuffle table wouldn't even be needed.
A simple 8-byte prefix sum could generate the shuffle mask.
This is because (at best) only every other byte gets dropped (using the uint32_t "1234" format).

I've experimented with a prefix sum for shuffle and didnt see much of a performance increase for my test set. I'd be interested in seeing a more competent implementation of this?

commented

I was unable to find any shortcuts for encoding using prefix-sums. My apologies.

Zigzag is overkill for int16,
assuming these are the only two states:
A.) Two byte literal.
B.) One byte sign extend.

Hmm, I sort of see what you're saying, but I'm having trouble putting together code to implement it - do you have a psuedo code example of what youre suggesting?

commented

It it is easier to check bit_7 than it is to check bit_0.

m = pcmpgtb 0, v
m = psllw m, 8
v = pxor v, m

I don't think anything is worth doing while they are being piped through the int32 functions.
It'd be pretty easy to add real support for int16... but it is too boring.

I've been thinking about doing some avx2 code for shorts...
I'd repurpose my "despacer" for the encoder.
my current thoughts on a decoder

thanks - ill take a look

commented

attempt at uint16 support using SSSE3
https://github.com/aqrit/streamvbyte/blob/master/contrib/short_varint.c
(strategies chosen on a whim)

@aqrit here are some performance results against using uint16 with pystreamvbyte (looking good!).

$ seq 3 8 | xargs -I% ./tests/perf --dtype uint16 --size 1e%
                   streamvbyte_encode                   
encoding time 0.000878 s,     113,941,280 uint16s/sec ( 0.228 GB/s)
decoding time 0.000846 s,     118,175,512 uint16s/sec ( 0.236 GB/s)
Compressed 2,000 bytes to 1,250 bytes (62.50%)
                       encode_16                        
encoding time 0.000617 s,     162,136,568 uint16s/sec ( 0.324 GB/s)
decoding time 0.000710 s,     140,923,275 uint16s/sec ( 0.282 GB/s)
Compressed 2,000 bytes to 1,125 bytes (56.25%)
                   streamvbyte_encode                   
encoding time 0.001599 s,     625,567,691 uint16s/sec ( 1.251 GB/s)
decoding time 0.001213 s,     824,655,188 uint16s/sec ( 1.649 GB/s)
Compressed 20,000 bytes to 12,500 bytes (62.50%)
                       encode_16                        
encoding time 0.000885 s,   1,130,241,092 uint16s/sec ( 2.260 GB/s)
decoding time 0.000759 s,   1,316,818,644 uint16s/sec ( 2.634 GB/s)
Compressed 20,000 bytes to 11,250 bytes (56.25%)
                   streamvbyte_encode                   
encoding time 0.015245 s,     655,947,521 uint16s/sec ( 1.312 GB/s)
decoding time 0.007395 s,   1,352,308,383 uint16s/sec ( 2.705 GB/s)
Compressed 200,000 bytes to 125,000 bytes (62.50%)
                       encode_16                        
encoding time 0.004682 s,   2,135,859,446 uint16s/sec ( 4.272 GB/s)
decoding time 0.002929 s,   3,414,694,145 uint16s/sec ( 6.829 GB/s)
Compressed 200,000 bytes to 112,500 bytes (56.25%)
                   streamvbyte_encode                   
encoding time 0.071367 s,   1,401,202,893 uint16s/sec ( 2.802 GB/s)
decoding time 0.049594 s,   2,016,379,047 uint16s/sec ( 4.033 GB/s)
Compressed 2,000,000 bytes to 1,250,000 bytes (62.50%)
                       encode_16                        
encoding time 0.029260 s,   3,417,658,937 uint16s/sec ( 6.835 GB/s)
decoding time 0.017225 s,   5,805,457,615 uint16s/sec (11.611 GB/s)
Compressed 2,000,000 bytes to 1,125,000 bytes (56.25%)
                   streamvbyte_encode                   
encoding time 3.109820 s,     321,562,067 uint16s/sec ( 0.643 GB/s)
decoding time 2.542501 s,     393,313,440 uint16s/sec ( 0.787 GB/s)
Compressed 20,000,000 bytes to 12,500,000 bytes (62.50%)
                       encode_16                        
encoding time 0.680088 s,   1,470,398,713 uint16s/sec ( 2.941 GB/s)
decoding time 0.281700 s,   3,549,875,880 uint16s/sec ( 7.100 GB/s)
Compressed 20,000,000 bytes to 11,250,000 bytes (56.25%)
                   streamvbyte_encode                   
encoding time 31.072751 s,     321,825,382 uint16s/sec ( 0.644 GB/s)
decoding time 32.966249 s,     303,340,543 uint16s/sec ( 0.607 GB/s)
Compressed 200,000,000 bytes to 125,000,000 bytes (62.50%)
                       encode_16                        
encoding time 8.543753 s,   1,170,445,798 uint16s/sec ( 2.341 GB/s)
decoding time 11.135855 s,     898,000,222 uint16s/sec ( 1.796 GB/s)
Compressed 200,000,000 bytes to 112,500,000 bytes (56.25%)

Array size vs throughput

short_encode

Any further thoughts on int16?

Someone want to prepare a PR based on @aqrit 's work?

commented

Any further thoughts on int16?

Not sure what you’re looking for...

Here a simple RLE of the HIBYTEs was used.
(the extra subtract is used to turn it into RLE0)

uint16 could be the same speed as uint32 if one wanted to use a 4 KiB dictionary.
But I like a 64-bit multiply better...
For AVX2, I still like shift/add of nibbles to do the prefix sum.

The decoder’s inner loop could be fully unrolled.
The decoder’s scalar tail loop should probably be using cmov.

There is a lot of "room for improvement" in the encoder.

Signed integer support is just a couple extra instructions.

ramblings:
If one wanted to try for a little better compression ratio...
One could move back towards a traditional “VByte”/sql-lite style.
psubusb would let us choose any particular partition point (0x80...0xFE)
So say, if we knew that all values in a block used less than 13 bits, we could choose a partition of 240.
Where values above 240 use 2 bytes and values below use 1 byte.
If all 16 bits were needed in a block then we’d put the partition at 128 and pick up all the sign bits as extra 1-bit per word overhead. This strat might deal better with noise than a bit-packer?

There is a lot of "room for improvement" in the encoder.

Encoding speed is more important than decoding for my use-case. Do you think AVX2 would result in a significant improvement?

Signed integer support is just a couple extra instructions.

Could you push a version? I'm more comfortable in Python and C with intrinsics scare me 😄

If one wanted to try for a little better compression ratio...

I've found a pass of zstd (level 1) after svb is very effective but quite costly in comparison. For the data I'm primarily interested in the range is actually only 11 bits and with delta encoding ~6.

commented

I'll post something this weekend. If optimizing for encoding speed, the pseudo zigzag operation only requires 1 instruction (to encode).

AVX2
The "despacer" step for AVX2 is pretty much the same speed. Maybe gather would do something... on a few of the new(ish) machines?

zstd
Using FSE directly may be better?

commented

If a 16-bit value in the range 0...255 is encoded as 1 byte
and everything else is encoded as 2 bytes.

How does one instead encode the the range -128...127 as 1 byte?
Answer: just add/subtract 128 to/from each value.

I can't decide how to optimize the encoder...
So instead here is an non-optimized version w/delta coding and signed integer support
(that may or may not be correct)
https://gist.github.com/aqrit/ef84d284cebe861d9e57db4129bcafc3

@aqrit I think I am missing something. I can see the delta coding but I'm confused on the signed integer support as the functions still expect uint16_t. Is the idea that calling function will shift the values by whatever the range calls for?

Anyway, tests seem to pass and here are some performance numbers -

$ seq 3 8 | xargs -I% ./tests/perf --dtype uint16 --size 1e%
                   streamvbyte_encode                   
encoding time 0.000732 s,     136,590,931 uint16s/sec ( 0.273 GB/s)
decoding time 0.000711 s,     140,561,579 uint16s/sec ( 0.281 GB/s)
Compressed 2,000 bytes to 1,250 bytes (62.50%)
                       encode_16_delta                        
encoding time 0.000477 s,     209,727,181 uint16s/sec ( 0.419 GB/s)
decoding time 0.000554 s,     180,504,078 uint16s/sec ( 0.361 GB/s)
Compressed 2,000 bytes to 1,343 bytes (67.15%)
                   streamvbyte_encode                   
encoding time 0.001404 s,     712,167,519 uint16s/sec ( 1.424 GB/s)
decoding time 0.001162 s,     860,649,607 uint16s/sec ( 1.721 GB/s)
Compressed 20,000 bytes to 12,500 bytes (62.50%)
                       encode_16_delta                        
encoding time 0.000861 s,   1,161,541,277 uint16s/sec ( 2.323 GB/s)
decoding time 0.000940 s,   1,064,149,052 uint16s/sec ( 2.128 GB/s)
Compressed 20,000 bytes to 13,396 bytes (66.98%)
                   streamvbyte_encode                   
encoding time 0.014632 s,     683,416,284 uint16s/sec ( 1.367 GB/s)
decoding time 0.006711 s,   1,489,996,555 uint16s/sec ( 2.980 GB/s)
Compressed 200,000 bytes to 125,000 bytes (62.50%)
                       encode_16_delta                        
encoding time 0.004137 s,   2,417,190,092 uint16s/sec ( 4.834 GB/s)
decoding time 0.004507 s,   2,218,526,123 uint16s/sec ( 4.437 GB/s)
Compressed 200,000 bytes to 134,093 bytes (67.05%)
                   streamvbyte_encode                   
encoding time 0.073112 s,   1,367,762,691 uint16s/sec ( 2.736 GB/s)
decoding time 0.050156 s,   1,993,780,085 uint16s/sec ( 3.988 GB/s)
Compressed 2,000,000 bytes to 1,250,000 bytes (62.50%)
                       encode_16_delta                        
encoding time 0.032616 s,   3,065,940,498 uint16s/sec ( 6.132 GB/s)
decoding time 0.034485 s,   2,899,804,526 uint16s/sec ( 5.800 GB/s)
Compressed 2,000,000 bytes to 1,343,839 bytes (67.19%)
                   streamvbyte_encode                   
encoding time 3.970338 s,     251,867,737 uint16s/sec ( 0.504 GB/s)
decoding time 3.336252 s,     299,737,591 uint16s/sec ( 0.599 GB/s)
Compressed 20,000,000 bytes to 12,500,000 bytes (62.50%)
                       encode_16_delta                        
encoding time 1.036077 s,     965,179,624 uint16s/sec ( 1.930 GB/s)
decoding time 0.392249 s,   2,549,402,805 uint16s/sec ( 5.099 GB/s)
Compressed 20,000,000 bytes to 13,427,945 bytes (67.14%)
                   streamvbyte_encode                   
encoding time 34.064612 s,     293,559,780 uint16s/sec ( 0.587 GB/s)
decoding time 36.469576 s,     274,201,159 uint16s/sec ( 0.548 GB/s)
Compressed 200,000,000 bytes to 125,000,000 bytes (62.50%)
                       encode_16_delta                        
encoding time 10.684904 s,     935,899,845 uint16s/sec ( 1.872 GB/s)
decoding time 13.823414 s,     723,410,296 uint16s/sec ( 1.447 GB/s)
Compressed 200,000,000 bytes to 134,275,973 bytes (67.14%)
commented

It is the exact same operation for signed or unsigned numbers (AFAIK).

If the function signature was changed to int16_t....
Then I'd probably cast it back to uint16_t inside the function.
(because bit-twiddling is done on unsigned numbers in C)

Do not try to opcode the sign. That is impossible. Instead, only try to realize the truth: there is no sign. Then you'll see it is not the sign-type that differentiates, it is only yourself.

Using FSE directly may be better?

I spent some time looking at fse and it seems streamvbyte + zstd is hard to beat (encoding).

Time (s) Compression (%)
zstd 1.002 63.98
fse 0.690 64.81
svb + fse 0.632 41.53
svb + zstd 0.431 37.00
svb 0.241 63.53

Using your latest short_enc yields impressive results.

Time (s) Compression (%)
svb16 + zstd 0.242 37.13
svb16 0.048 56.28
commented

back on the topic of signed 32-bit integers:

Zigzag vs Sign extending from a variable bit-width
(https://graphics.stanford.edu/~seander/bithacks.html#VariableSignExtend)

Zigzag and VariableSignExtend -- using SSE41 – both requires 4 instructions to decode.

Zigzag:
2 instructions to create sign mask from bit_0
1 shift right to drop bit_0
1 xor with the sign mask

VariableSignExtend:
(the bit to extend -- aka.mask bit -- would be marked in the shuffle control index)
two instruction to isolate the mask bit (and, shift)
two instruction to extend (x ^ m) – m;

However,
if optimizing for pure decode performance,
VariableSignExtend decode could be done using only two instructions in total (I think...)

So decoding could just be:
psrad shuf, 8
paddd x, shuf

The sign bit could be flipped (XORed) during encoding instead of during decoding.
(but do not flip bit_31... if combining with the next trick)

The decode shuffle index is known during encoding.
So... some value could be subtracted from the original value…
such that the when the shifted decode shuffle index is added to the value,
during decoding, the original value will be recovered.

For scalar decoding, w/fast unaligned memory access,
VariableSignExtend is FREE because we probably want to be using a variable shift anyways.
(https://gist.github.com/aqrit/f666e05362bc75376ac59a34879fad4f)

VariableSignExtend encoding is 1 instruction cheaper to encode than Zigzag, in the general case.

Raising a question related to supporting signed integers in this library: how should the library handle overflows/underflows in signed integer arithmetic (undefined behavior) when working with signed integers?

Here are some examples where we don't handle potential signed overflows at the moment

for example, the zig-zag encode function does val + val which means if my signed integer is > INT32_MAX / 2 I will run into undefined behavior. Or when I delta-encode and in[i] - prev overflows. The same is true for INT32_MIN.

If we support signed integers in the future, we should maybe think about raising an error with the user instead of running into undefined behavior. With unsigned integers at the moment we get a wrap-around and the behavior we want "for free".

@daniel-j-h There is no need to report an error. All processors use two’s complement so addition instructions are mostly sign unaware (except for the flags).

A SIMD zigzag would be written with intrinsics and would thus sidestep UB.

Otherwise we can provide qualifiers to discourage sanitizers from providing false positives.

The C and C++ languages we code against specify integer overflow as undefined behavior; it's unclear to me if we can just "sidestep" this UB therefore or if compilers can do whatever they want in the case of UB.

I just wanted to bring this up because it was not clear to me that we encode ints in zigzag fashion and then have these potential undefined behavior issues in the current zigzag implementation, when I was looking at the pystreamvbyte wrapper.

@daniel-j-h

it's unclear to me if we can just "sidestep" this UB

SIMD intrinsics are not covered by the standard at all so there is no UB. So yes, you can sidestep entirely.

Regarding the larger issues... Before C++20, I do not think you could sensibly implement zigzag encoding in a well-defined C/C++ way as per the standards. The zigzag encoding trick assumes that you have two complement's, something that you were not allowed to assume prior to C++20.

Let me elaborate...

This...

uint32_t _zigzag_encode_32 (int32_t val) {
	return (val + val) ^ (val >> 31);
}

should become... in ARM NEON...

veorq_s32( vaddq_s32(val, val), vshrq_n_s32(val,31)) 

The latter will never trigger UB.

Notice how easy it is.

Or @aqrit 's x64 intrinsics...

__m128i _mm_zigzag_encode_epi32 (__m128i v) {
	__m128i signmask =_mm_cmpgt_epi32(_mm_setzero_si128(), v);
	return _mm_xor_si128(_mm_add_epi32(v, v), signmask);
}

Notice that we expect the compiler to autovectorize zigzag to equivalent code. However, we could do better with a tight integration as we would avoid the two passes over the data.

@daniel-j-h More critically, the UB is skin deep. You can cast to unsigned integers, do the addition, and cast back. Prior to C++20, the result was implementation defined, but with C++20, you will get consistent standard-defined results.

That is not to say that signed integer overflow is not an issue. It can be. But it is nothing like 'well, we can't do this at all'.

We have to keep in mind that, underneath, the processor has no issue whatsoever with these overflows.

We now run undefined sanitizers in CI. See #36