Cyan4973 / FiniteStateEntropy

New generation entropy codecs : Finite State Entropy and Huff0

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Comparison to arithmetic coding?

MarcusJohnson91 opened this issue · comments

I know the wiki says that the performance is similar, but can we get a benchmark comparing processing time and compression ratio to know exactly how well it performs compared to it's closest competition?

The problem is to find a relevant competitor.
There are many different arithmetic coders out there, but selecting one of them as "representative" could raise suspicion that it was an easy target to beat. So it would need to be an reknowned version.

In contrast with Huffman, which has an excellent and widely acknowledged implementation within zlib, I don't know yet of an arithmetic implementation with identical status. Maybe within lzma, although my understanding is that lzma is limited to binary arithmetic coder, a variant which do not compete with FSE since it's limited to 2 symbols (yes/no) per operation.

Another possibility could be to provide the Shannon limit, which is the theoretical maximum compression ratio, thus showing there is almost nothing to left to gain. But it wouldn't help to compare speed.

A comparison to the Shannon entropy limit would be better than nothing, any idea what the rough numbers are off hand?

I haven't calculated them precisely, but I suspect FSE to be fairly close to the limit, likely < 0.1%, although without counting headers (which account for the majority of the difference). Note though that the header issue would be the same for any arithmetic coder.

Wow, I was expecting at least 5%, and Huffman is 12.5% minimum (not counting headers), that's crazy good.

bumblebritches57, the closeness to Shannon entropy depends on the probability distribution - e.g. Huffman is perfect for probabilities being a power of 1/2.
Like arithmetic coding, ANS can get as close Shannon as we want for any probability distribution - the (KL) distance depends on parameters. tANS variant used in FSE is defined by a symbol spread: filling a table of length L (2048 in FSE) with appearances of symbols - with proportions corresponding to the probability distribution. You can get Huffman this way by spreading symbols in ranges, which have length being a power of 2, so tANS can be seen as extension of Huffman.
The larger L, the closer to Shannon - the distance generally decrease like 1/L^2: double L in FSE to get 4 times closer, but the tables might no longer fit L1 cache (speed). For LZMA-like compression there is used a bit more accurate rANS variant - which requires multiplication (in contrast to tANS), but allows for a bit better memory tradeoffs and is better for dynamical modifications in adaptive compression.
If you want to test distance from Shannon of tANS for various parameters and symbol spreads: https://github.com/JarekDuda/AsymmetricNumeralSystemsToolkit
Some benchamarks of entropy coders:
http://encode.ru/threads/1920-In-memory-open-source-benchmark-of-entropy-coders?p=45168&viewfull=1#post45168
https://sites.google.com/site/powturbo/entropy-coder

Thanks I'll check it out

You could also compare against static block-based arithmetic compression. I optimised one of those (unrolling and interleaving Eugene Shelwien's) just before Ryg posted his rANS, so then gave that the same treatment to that as a side by side comparison. Both are beaten by FSE, but due to the interleaving and static nature I think this is one of the fastest arithmetic (range) coders out there and as such targetting the same job as FSE.

See arith_static.c in https://github.com/jkbonfield/rans_static or look at my benchmarks, although the data may not be so obvious.