Compression uint32_t stream with lots of zeroes
vkazanov opened this issue · comments
I am trying to use streamvbyte
in an in-house archiving software which we'll hopefully be able to publish as open source at some point. Your library is a great match for my use-case with it's brilliant performance and compression that's good enough.
There's a catch though.
My stream of uint32_t's looks like the following: 234, 566, 0, 0, 333, 0, 0, 0, 1578987, 0, 234, 444, <a few million uint32_t's more>
. Notice that there are lots of zeroes, about 30-40% of the stream. Zero distribution is highly unpredictable, and I know that zero run length is probably gonna be about 2-3 zeroes max.
But streamvbyte
can only use 1,2,3,4 bytes per integer, depending on the value. I calculated that for my use case it's much more reasonable to have something like 0,1,2,4, i.e.:
- I don't want to include zeroes in the stream.
- I don't really need 3 bytes values.
This would mean that I would only have to keep about 2 bits per zero value.
I am going through your related papers - which are very readable! - and the code and it seems to me that it should be possible to just patch streamvbyte
to match my needs.
So here's a question:
- Is there anything that I don't understand and that might become a problem here?
- I'll have 3-5 full time days to solve the issue. Is there a way I can solve the issue and contribute back some code?
Thank you!
That's definitively a fine possibility. There is nothing that I can see that should be a problem. In fact, what you describe is something we discussed.
If your willing to do some of the coding, we can certainly assist you... I don't expect it to be terribly challenging. A few days should be enough. Getting a PR to solve your use case would be great.
Great! I was just wondering if there's something I misunderstand and if there are any fundamental blocks. I'll come back next week with either a PR or more specific questions.
Feel free to email me personally if needed.
Do you need vector encoding? The scalar version would be easy to modify, but there are a couple of tricks in the vector code that I could help you with if needed.
@lemire thanks! Should I use email from your homepage?
@KWillets Practically speaking I don't really care about encoding. Decoding speed is much more important for me.
On the other hand I do have some extra time, so if we can find a clean way to integrate my changes into the library then both encoding/decoding will have to modified.
Notice that I did not yet properly dive into the code, just doing some preparatory talks, reading papers, checking if project maintainers are responsive, etc. :-)
Yes, the email address from my home page is fine.
proof-of-concept:
https://gist.github.com/aqrit/9272c47b3f1ce23c565a7210b6935102
/bored :squirrel:
I merged @vkazanov 's code.