lemire / streamvbyte

Fast integer compression in C using the StreamVByte codec

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Compression uint32_t stream with lots of zeroes

vkazanov opened this issue · comments

I am trying to use streamvbyte in an in-house archiving software which we'll hopefully be able to publish as open source at some point. Your library is a great match for my use-case with it's brilliant performance and compression that's good enough.

There's a catch though.

My stream of uint32_t's looks like the following: 234, 566, 0, 0, 333, 0, 0, 0, 1578987, 0, 234, 444, <a few million uint32_t's more>. Notice that there are lots of zeroes, about 30-40% of the stream. Zero distribution is highly unpredictable, and I know that zero run length is probably gonna be about 2-3 zeroes max.

But streamvbyte can only use 1,2,3,4 bytes per integer, depending on the value. I calculated that for my use case it's much more reasonable to have something like 0,1,2,4, i.e.:

  1. I don't want to include zeroes in the stream.
  2. I don't really need 3 bytes values.

This would mean that I would only have to keep about 2 bits per zero value.

I am going through your related papers - which are very readable! - and the code and it seems to me that it should be possible to just patch streamvbyte to match my needs.

So here's a question:

  1. Is there anything that I don't understand and that might become a problem here?
  2. I'll have 3-5 full time days to solve the issue. Is there a way I can solve the issue and contribute back some code?

Thank you!

That's definitively a fine possibility. There is nothing that I can see that should be a problem. In fact, what you describe is something we discussed.

If your willing to do some of the coding, we can certainly assist you... I don't expect it to be terribly challenging. A few days should be enough. Getting a PR to solve your use case would be great.

Great! I was just wondering if there's something I misunderstand and if there are any fundamental blocks. I'll come back next week with either a PR or more specific questions.

Feel free to email me personally if needed.

Do you need vector encoding? The scalar version would be easy to modify, but there are a couple of tricks in the vector code that I could help you with if needed.

@lemire thanks! Should I use email from your homepage?

@KWillets Practically speaking I don't really care about encoding. Decoding speed is much more important for me.

On the other hand I do have some extra time, so if we can find a clean way to integrate my changes into the library then both encoding/decoding will have to modified.

Notice that I did not yet properly dive into the code, just doing some preparatory talks, reading papers, checking if project maintainers are responsive, etc. :-)

Yes, the email address from my home page is fine.

I merged @vkazanov 's code.