mxmlnkn / indexed_bzip2

Fast parallel random access to bzip2 and gzip files in Python

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add lz4 support

mxmlnkn opened this issue · comments

It seems that seeking inside lz4 should be easily possible the same way as for gzip. The only difference is that the lz4 window size is 64 KiB instead of 32 KiB. It would help the index size to support the space-saving ideas mentioned in mxmlnkn/rapidgzip#17 for this. In contrast to gzip, it should be easy to support creating windows inside an lz4 block, even though lz4 block sizes are supposed to be limited to 4 MiB for wider support. In general, they can be arbitrarily large (64-bit size).

Seeking inside large lz4 files would be nice for ratarmount and other other applications. Specialized subset formats already make it possible but it makes requirements on the compressor used. The same applies to the lz4 frame content-size flag, which is only optional.

Parallel decompression is a whole another matter.

  • It might be possible to detect valid start positions because there seems to be some things to check for, e.g., offset may not be 0, high 4 bits probably are zero when the low 4 bits are < 15.
  • Looking for lz4 blocks themselves might not be advisable because they can be arbitrarily large and 4 MiB is a suggested maximum size. Needs to be checked what some real values are for various compressors. lz4 frames do have 4 B magic bytes byte-aligned. This should make it possible to search quite fast for them with memchr. Maybe, this would make it feasible again.
  • "this format has a maximum achievable compression ratio of about ~250.".
  • There is no Huffman-coding. Because of this, the single-core decompression probably might reach memcpy bandwidths anyway. Probably make no sense to parallelize this only maybe on 4+-channel RAM systems. But then we still would have the interdependencies between lz4 blocks. Adding a second pass with markers might slow down decompression by more than 2x and thereby make it infeasible again to parallelize. It might be possible to only compute the end-of-block window by propagating only that and discarding the output results. Then windows can be resolved in serial and then blocks could be decompressed in parallel. Again, doubtful it will help to make anything faster.
  • Note that parallel decompression does not work anyway easily because it works nicely with 16-bit marker symbols for deflate but we would need 17-bits to store a 16-bit index into the 64 KiB window and another flag to signal literals vs. markers. So we would first need to extend the algorithm to 32-bit, but this increases the overhead even more. The same is true for Zstandard. I think both are better to only support as serial decompression followed by random access + parallel decompression after the index has been created.

Also add lzo support.