kspalaiologos / bzip3

A better and stronger spiritual successor to BZip2.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Idea: Explore using BWT Tunneling

pothos opened this issue · comments

In https://github.com/waYne1337/tbwt the work of https://arxiv.org/abs/1804.01937 is implemented and it claims to offer a significant compression improvement.
Maybe this could be integrated here, (de)selectable through a command line flag? In case there is a marker in the bzip3 format it could be used to tell old decompressors that they can't handle it.

Interesting piece of software. Having conducted some empirical tests, I don't think that it is a good idea for a couple of reasons.

The benchmarks were conducted on the Silesia corpus, I ripped out the relevant parts from the repo and used them in a Lua driver script for a small in-house compression testing tool. Seemingly the tunneling stage requires more than 60% of the compressor's runtime on the Silesia corpus, yielding a compression gain on order of a megabyte (~ 1% CR difference). While this could be a game changer for "slow" BWT compressors (and undoubtedly extremely interesting from a CS theory standpoint!), it would be difficult to integrate it nicely into a real world compressor like bzip3. I personally see data compression as a knob, which you can turn to get better CR at the cost of speed and vice versa (admitting that very clever people like Jarek Duda can game this system and have both). So:

  • Following this chain of thought, I think that TBWT is overall not the best way to twist the knob: considering this performance hit and such a small CR improvement, it would be probably better to use proper context mixing with multiple models (like PAQ), not a small model like bzip3's or BBB's (- which hinges mostly in secondary symbol estimation).

  • Having more code paths introduces more problems with security and provenance which a lot of hobby compressors or research projects don't tend to care about. bzip3 technically started out as a side project I wrote in high school and the amount of things I had to fix due to my own negligence was overwhelming :-).

  • While putting multiple different algorithms to switch between is a good thing for a data compressor (which in my opinion has contributed a lot to zstandard being as highly regarded and genuinely good as it is), the current major version of bzip3 will probably stay being an "analog" compressor, which fits in under a thousand lines (bar SAIS).

  • Having a very slow compression mode in a generally balanced data compressor that adds little gains and isn't data-specific is a very good method of throwing bones into the mouths of skeptics who have already made up their mind about bzip3.

Sounds good, great that you could have a look already. Will close :)