hardware crc32

Question

hardware crc32

rurban opened this issue 2 years ago · comments

for performance reasons you need to probe for hardware crc32 support (most chips do have it)
and use it, instead of the SW variant. there are various variants (PCLMUL in x86 or just the crc32 intrinsics)

Kamila Szewczyk · Answer 1 · Fri May 13 2022 14:37:41 GMT+0800 (China Standard Time)

CRC32 takes approximately 1.1% of the runtime. The timing on the Silesia corpus is 17.42s, so CRC32 took an absolutely astonishing amount of time - 170 milliseconds.

Hardware CRC32 uses obscure polynomials and as such it's also not portable (one architecture has hardware CRC32 with polynomial 1, other has hardware CRC32 with polynomial 2).

The optimisation is mostly irrelevant and the focus on optimisation could be moved to e.g. src/cm.c which could really use some help now, as it uses probably more than 90% of the runtime.

Reini Urban · Answer 2 · Fri May 13 2022 14:56:23 GMT+0800 (China Standard Time)

Oh nice. I thought it's more there. So closing. The crc32c polynomial is the usual one

synodriver · Answer 3 · Mon Aug 29 2022 16:36:57 GMT+0800 (China Standard Time)

Actually I did some profile on the call stack. Although the call is from python, the native stack did show something. See bz3_encode_block and the following stacks.

Kamila Szewczyk · Answer 4 · Mon Aug 29 2022 20:37:10 GMT+0800 (China Standard Time)

what was the testing corpus? most of the data must've been collapsed by RLE, making the graph appear as if arithmetic coding, SAIS construction, ... are taking little to no time - which can't be true.

ps: i just came home from surgery, i will be less responsive for a while wrt. OSS.

synodriver · Answer 5 · Tue Aug 30 2022 01:03:11 GMT+0800 (China Standard Time)

I tested it with some repeated data generated in python. The profiler I'm using is py-spy, which can profile native extensions. I was also surprised at the result.

Kamila Szewczyk · Answer 6 · Thu Sep 01 2022 02:10:37 GMT+0800 (China Standard Time)

your benchmark must be wrong. try bzip3 on more real-world data where RLE doesn't save this much space.