Compression uint32_t stream with lots of zeroes

Question

Compression uint32_t stream with lots of zeroes

vkazanov opened this issue 6 years ago · comments

I am trying to use streamvbyte in an in-house archiving software which we'll hopefully be able to publish as open source at some point. Your library is a great match for my use-case with it's brilliant performance and compression that's good enough.

There's a catch though.

My stream of uint32_t's looks like the following: 234, 566, 0, 0, 333, 0, 0, 0, 1578987, 0, 234, 444, <a few million uint32_t's more>. Notice that there are lots of zeroes, about 30-40% of the stream. Zero distribution is highly unpredictable, and I know that zero run length is probably gonna be about 2-3 zeroes max.

But streamvbyte can only use 1,2,3,4 bytes per integer, depending on the value. I calculated that for my use case it's much more reasonable to have something like 0,1,2,4, i.e.:

I don't want to include zeroes in the stream.
I don't really need 3 bytes values.

This would mean that I would only have to keep about 2 bits per zero value.

I am going through your related papers - which are very readable! - and the code and it seems to me that it should be possible to just patch streamvbyte to match my needs.

So here's a question:

Is there anything that I don't understand and that might become a problem here?
I'll have 3-5 full time days to solve the issue. Is there a way I can solve the issue and contribute back some code?

Thank you!

Daniel Lemire · Answer 1 · Wed Jul 11 2018 20:08:03 GMT+0800 (China Standard Time)

That's definitively a fine possibility. There is nothing that I can see that should be a problem. In fact, what you describe is something we discussed.

If your willing to do some of the coding, we can certainly assist you... I don't expect it to be terribly challenging. A few days should be enough. Getting a PR to solve your use case would be great.

Vladimir Kazanov · Answer 2 · Wed Jul 11 2018 22:17:24 GMT+0800 (China Standard Time)

Great! I was just wondering if there's something I misunderstand and if there are any fundamental blocks. I'll come back next week with either a PR or more specific questions.

Daniel Lemire · Answer 3 · Wed Jul 11 2018 22:27:18 GMT+0800 (China Standard Time)

Feel free to email me personally if needed.

Kendall Willets · Answer 4 · Thu Jul 12 2018 02:14:45 GMT+0800 (China Standard Time)

Do you need vector encoding? The scalar version would be easy to modify, but there are a couple of tricks in the vector code that I could help you with if needed.

Vladimir Kazanov · Answer 5 · Thu Jul 12 2018 16:49:50 GMT+0800 (China Standard Time)

@lemire thanks! Should I use email from your homepage?

@KWillets Practically speaking I don't really care about encoding. Decoding speed is much more important for me.

On the other hand I do have some extra time, so if we can find a clean way to integrate my changes into the library then both encoding/decoding will have to modified.

Notice that I did not yet properly dive into the code, just doing some preparatory talks, reading papers, checking if project maintainers are responsive, etc. :-)

Daniel Lemire · Answer 6 · Thu Jul 12 2018 19:48:35 GMT+0800 (China Standard Time)

Yes, the email address from my home page is fine.

aqrit · Answer 7 · Tue Jul 17 2018 09:47:35 GMT+0800 (China Standard Time)

proof-of-concept:
https://gist.github.com/aqrit/9272c47b3f1ce23c565a7210b6935102

/bored :squirrel:

Daniel Lemire · Answer 8 · Thu Aug 23 2018 07:18:47 GMT+0800 (China Standard Time)

I merged @vkazanov 's code.