kspalaiologos / bzip3

A better and stronger spiritual successor to BZip2.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Extreme use case - very short data & long data (issues with header size, slow "take off", etc.)

dumblob opened this issue · comments

Just few months ago I got a weird task on my table - to compress data which either were very short (1-4 bytes long) or quite long (hundreds of bytes, KBytes, or even megabytes).

I couldn't find any good compression for this as for small bytes there were issues with too large headers (more than 1 byte long) and with bad compression (for such very low number of bytes there are different compression techniques - the best of which appeared to be the "heuristic" to not compress anything and only if the stream gets larger than 16bytes or so suddenly start compressing the data - of course counting in the entropy from the first 16 bytes).

And for large data, there wasn't many algorithms which would not allocate much and have extremely quick "take off" (i.e. not building huge dictionaries nor computing anything "up front"). Basically I was seeking an algorithm which from the very beginning did "pay as you go" and not "pay up front more to make it potentially cheaper later".

Do you plan to support this extreme use case with bz3?

Data compression is all about finding correlations - that's why neural networks and adaptive model weighting took off. It's hard to compress very short data (~16 bytes), because it may not have enough correlations to warrant efficient output.

BZip3 is not a dictionary compressor - upfront, it allocates around 8n + 180KB of memory, where n is the block size - 5n of which is used for a decently fast construction of a suffix array for the Burrows-Wheeler transform, while the remaining 3n is the two buffers that hold the compressed data.

When the block size is greater than 3MiB, BZip3 employs what I called the "Sorted Rank Transform" (reuse of an existing idea I had noticed elsewhere; it was very hard to find it online again), while on block sizes smaller than 3MiB it employs the same transformation as BZip2 - the Move-To-Front transform.

BZip3 does already perform better than BZip2 on small text files:

bzip3:  0.01s user 0.00s system 96% cpu 5M memory 0.006 total
bzip2:  0.00s user 0.00s system 92% cpu 5M memory 0.005 total


18109 corpus/bee_movie.txt.bz2
17836 corpus/bee_movie.bz3

But "small" to the magnitude of 1-4 bytes doesn't seem to be manageable with any contemporary compression nor it does not seem that desirable to me as a compressor author.

That said, for short input data, bzip3 could use a special preset which ignores all transformations except the final entropy coding step to omit potentially 20 bytes of header data. I wouldn't count on it performing well, though, since the minimum output length in total would be over 5 bytes.

Maybe you would have better luck with specialised compression - if the short messages are English text, then you could try smaz. Most approaches used by compressors (e.g. Huffman coding) carry a penalty (in this case, the Huffman table) that make it unsuitable for this purpose. Generally speaking, what you're looking for is a dictionary compressor with a built-in dictionary that "knows" something about your input data upfront, and this niche is filled by all the countless variants of Lempel-Ziv already (like LZMA, LZO, LZ77, LZ4, etc...).

Moving on to the bigger data volume scenario, BZip3 provides a benefit over other compressors - you could compress or decompress every block of file in parallel (which is not yet implemented in the reference implementation, but is planned for the future). Additionally, all the "startup cost" is just initialisation of the BZip3 state data (which is fast enough as shown above) - which technically needs to happen just once per the lifetime of your application. The state initialisation boils down just to allocating a bunch of memory, while each block, sections of this memory are filled with fresh data.

BZip3 was not exactly made to be able to somehow fulfil this purpose on its way, but I believe that you will be able to see some value regarding the latter case in it after my clarifications.

Moving on to the bigger data volume scenario, BZip3 provides a benefit over other compressors - you could compress or decompress every block of file in parallel (which is not yet implemented in the reference implementation, but is planned for the future).

This sounds very promising. Do you think one could achieve the speeds of https://github.com/Blosc/c-blosc2 ?

Btw. do you think a purely JavaScript bz3 compressor is a viable option for small data (up to just a few MBytes)? Or is bz3 (planned to be) too much reliant on CPU optimization to perform acceptably?

Additionally, all the "startup cost" is just initialisation of the BZip3 state data (which is fast enough as shown above) - which technically needs to happen just once per the lifetime of your application. The state initialisation boils down just to allocating a bunch of memory, while each block, sections of this memory are filled with fresh data.

Perfect! If the API emphasizes this reuse in a way that e.g. Python bindings would be "nudged" to follow this reuse principle (per thread local storage instance I suppose), then this would be a non-negligible advantage.

Generally speaking, what you're looking for is a dictionary compressor with a built-in dictionary that "knows" something about your input data upfront

Unfortunately no. The task was purely generic. Actually majority of the data was "unknown raw binary data".

Thinking about this, how about this hack for bz3 format specification:

  1. experimentally find the minimum size MS of random input data from which it is kind of guaranteed the compression will always make sense (i.e. will be always smaller than the input)
  2. declare in the bz3 spec that any input of size smaller than MS will never be compressed and thus it'll be equivalent either to memcpy() or even just returning the same pointer
  3. anything above MS will be compressed and if the resulting size will be smaller than MS it'll be padded to MS and then the bz3 header will always follow (specifying how much of the data was padded, etc. - just everything bz3 needs); if no padding occurs then the bz3 header will be simply inserted into the compressed data right at the MS offset

This way bz3 as format would guarantee not worsening the compression of small data, would retaine full compatibility with any data incl. streaming data, would be as efficient as is now (the padding shall be very uncommon and if so, it'd be totally neglibile as we're talking longer inputs at that moment).

Thoughts?

Additionally, all the "startup cost" is just initialisation of the BZip3 state data (which is fast enough as shown above) - which technically needs to happen just once per the lifetime of your application. The state initialisation boils down just to allocating a bunch of memory, while each block, sections of this memory are filled with fresh data.

Reading this again I might have wrongly asked about this. I meant to avoid recreating the compressor state for every single new input (imagine I have millions of those few-bytes-long binary strings mutually unrelated and uncorrelated on input and don't want to recreate the compressor state for every single such string). Does bz3 support this use case?

This sounds very promising. Do you think one could achieve the speeds of https://github.com/Blosc/c-blosc2 ?

BZip3 will compress data at around 12MiB/s (best case) or 8-9MiB/s (worst case) using a single thread. I am not aware of the benchmarks of the compressor you had linked before.

Do you think a purely JavaScript bz3 compressor is a viable option for small data (up to just a few MBytes)? Or is bz3 (planned to be) too much reliant on CPU optimization to perform acceptably?

BZip3 currently contains no CPU specific code. Additionally, it should run without problems on big endian and little endian machines. The code might require some tweaks to run on old ANSI C compilers.

Perfect! If the API emphasizes this reuse in a way that e.g. Python bindings would be "nudged" to follow this reuse principle (per thread local storage instance I suppose), then this would be a non-negligible advantage.

I don't plan on implementing Python bindings myself, but this is a good approach.

declare in the bz3 spec that any input of size smaller than MS will never be compressed and thus it'll be equivalent either to memcpy() or even just returning the same pointer

How would you distinguish a copied block from a compressed block? I plan on eventually implementing something akin to this, just not yet. I feel like BZip3 is unsuitable for very short messages and you'd be way better off simply compressing them in batches - consider this.

Reading this again I might have wrongly asked about this. I meant to avoid recreating the compressor state for every single new input (imagine I have millions of those few-bytes-long binary strings mutually unrelated and uncorrelated on input and don't want to recreate the compressor state for every single such string). Does bz3 support this use case?

Yes.

How would you distinguish a copied block from a compressed block?

Any block shorter than MS is by definition uncompressed (thanks to padding in those rare cases). And anything larger already has header (or shall I rather call it a "torsoer" as it's never at the "head" of the data ❓). So all cases covered 😉.

I plan on eventually implementing something akin to this, just not yet.

Looking forward to that! I'd call that a break-through among all existing compression formats I know of.

I feel like BZip3 is unsuitable for very short messages and you'd be way better off simply compressing them in batches - consider this.

That's the extreme part of the task - the impossibility to batch anything. My primary use case would be for data for a key-value database (for both keys and values ideally). The data are coming non-continuously in time. Any request to store a value has to be answered immediately to keep latency very low. Thus I can't wait for next incoming data to be stored to be able to batch things.

Roughly speaking, I get these results on small files:

Benchmark 1: ../bzip3 -e -b 1 cm.c cm.bz3
  Time (mean ± σ):       0.8 ms ±   0.2 ms    [User: 0.7 ms, System: 0.1 ms]
  Range (min … max):     0.7 ms …   1.5 ms    30 runs

Benchmark 2: bzip2 -k -f -9 cm.c
  Time (mean ± σ):       1.0 ms ±   0.1 ms    [User: 0.9 ms, System: 0.1 ms]
  Range (min … max):     0.9 ms …   1.7 ms    30 runs

Of course they're not very representative, since the program runtime is very small, but the initialisation is rather fast. BZip3 beats BZip2 by a small margin of 40 bytes:

1128 cm.bz3
4838 cm.c
1168 cm.c.bz2

Every byte counts (at least for my use case)!

I've done some tests and I concluded that I can't really help on messages shorter than 64 bytes. It's not really the goal of this project, so I'm closing this issue.

Ok, thanks anyway.

Btw. does the overall bz3 data format allow for cat archive.bz3 random_binary_data.bin > out.bz3; bz3_decompress out.bz3 > orig_data.bin? I.e. can I pad an existing bz3 archive with arbitrary data without confusing the decompressor?