Comparison to arithmetic coding?

Question

Comparison to arithmetic coding?

MarcusJohnson91 opened this issue 9 years ago · comments

I know the wiki says that the performance is similar, but can we get a benchmark comparing processing time and compression ratio to know exactly how well it performs compared to it's closest competition?

Yann Collet · Answer 1 · Wed Jan 06 2016 20:33:54 GMT+0800 (China Standard Time)

The problem is to find a relevant competitor.
There are many different arithmetic coders out there, but selecting one of them as "representative" could raise suspicion that it was an easy target to beat. So it would need to be an reknowned version.

In contrast with Huffman, which has an excellent and widely acknowledged implementation within zlib, I don't know yet of an arithmetic implementation with identical status. Maybe within lzma, although my understanding is that lzma is limited to binary arithmetic coder, a variant which do not compete with FSE since it's limited to 2 symbols (yes/no) per operation.

Another possibility could be to provide the Shannon limit, which is the theoretical maximum compression ratio, thus showing there is almost nothing to left to gain. But it wouldn't help to compare speed.

Marcus Johnson · Answer 2 · Wed Jan 06 2016 20:46:48 GMT+0800 (China Standard Time)

A comparison to the Shannon entropy limit would be better than nothing, any idea what the rough numbers are off hand?

Yann Collet · Answer 3 · Wed Jan 06 2016 21:29:04 GMT+0800 (China Standard Time)

I haven't calculated them precisely, but I suspect FSE to be fairly close to the limit, likely < 0.1%, although without counting headers (which account for the majority of the difference). Note though that the header issue would be the same for any arithmetic coder.

Marcus Johnson · Answer 4 · Wed Jan 06 2016 21:32:32 GMT+0800 (China Standard Time)

Wow, I was expecting at least 5%, and Huffman is 12.5% minimum (not counting headers), that's crazy good.

Jarek Duda · Answer 5 · Thu Jan 07 2016 14:42:35 GMT+0800 (China Standard Time)

bumblebritches57, the closeness to Shannon entropy depends on the probability distribution - e.g. Huffman is perfect for probabilities being a power of 1/2.
Like arithmetic coding, ANS can get as close Shannon as we want for any probability distribution - the (KL) distance depends on parameters. tANS variant used in FSE is defined by a symbol spread: filling a table of length L (2048 in FSE) with appearances of symbols - with proportions corresponding to the probability distribution. You can get Huffman this way by spreading symbols in ranges, which have length being a power of 2, so tANS can be seen as extension of Huffman.
The larger L, the closer to Shannon - the distance generally decrease like 1/L^2: double L in FSE to get 4 times closer, but the tables might no longer fit L1 cache (speed). For LZMA-like compression there is used a bit more accurate rANS variant - which requires multiplication (in contrast to tANS), but allows for a bit better memory tradeoffs and is better for dynamical modifications in adaptive compression.
If you want to test distance from Shannon of tANS for various parameters and symbol spreads: https://github.com/JarekDuda/AsymmetricNumeralSystemsToolkit
Some benchamarks of entropy coders:
http://encode.ru/threads/1920-In-memory-open-source-benchmark-of-entropy-coders?p=45168&viewfull=1#post45168
https://sites.google.com/site/powturbo/entropy-coder

Marcus Johnson · Answer 6 · Thu Jan 07 2016 15:44:34 GMT+0800 (China Standard Time)

Thanks I'll check it out

Bulat-Ziganshin · Answer 7 · Sun May 01 2016 06:07:20 GMT+0800 (China Standard Time)

http://sachingarg.com/compression/entropy_coding/64bit/sg_entropy_coders.zip looks like pretty standard codecs, at least for range coder part. more info at http://sachingarg.com/compression/entropy_coding/64bit/

James Bonfield · Answer 8 · Sun May 01 2016 11:07:09 GMT+0800 (China Standard Time)

You could also compare against static block-based arithmetic compression. I optimised one of those (unrolling and interleaving Eugene Shelwien's) just before Ryg posted his rANS, so then gave that the same treatment to that as a side by side comparison. Both are beaten by FSE, but due to the interleaving and static nature I think this is one of the fastest arithmetic (range) coders out there and as such targetting the same job as FSE.

See arith_static.c in https://github.com/jkbonfield/rans_static or look at my benchmarks, although the data may not be so obvious.