Cyan4973 / FiniteStateEntropy

New generation entropy codecs : Finite State Entropy and Huff0

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Inconsistencies in stream format

klauspost opened this issue · comments

I was looking into porting the streaming format, but hit some inconsistencies.

First of, this seems like a straight up bug: https://github.com/Cyan4973/FiniteStateEntropy/blob/dev/programs/fileio.c#L374

Note that the write is going to the wrong offsets.

Secondly the streaming format doc states that " max block size, 2^value from 0 to 0xA". This seems false, since the maximum block size if 16 bits, so the uncompressed block size can be stored in 2 bytes. Technically I guess you can have bigger values, but that would force you to have "full sized" blocks, but decompressing them will fail this check.

Thirdly it states that in regenerated size that "0 = 64 KB". This seems misleading if I am reading the code correct. There is no special value for 0 zeroes. However, if bit 5 (full block) is set, the block size is assumed to be the size set in the stream header (1 << n+10). So the 64KB isn't a special value.

I will probably implement a slightly different streaming format instead, since this seems a bit too flaky and I would like to have the option of bigger blocks.

Thanks for pointing that out @klauspost .
These are great inputs.

Indeed, this format was merely created as a kind of "demo" for fse,
just to prove to external observers that it does indeed produces a compressed output, from which the original can be regenerated.

To be fair, it has remained unmaintained for a long time.
Nonetheless, a "demo" should work, so I'll fix the error you reported.
I'll also fix the documentation. Indeed, the format changed a few times, and unfortunately, code comments has not followed.

What's more difficult though is to re-introduce large blocks.
Savings beyond 64KB are small, because the only things saved are the header and states, hence very little (~<200 bytes). That's why latest iteration settled on this limit.
But let's assume large blocks are wanted. Problem is, I'm short of a simple proposition that re-enables large blocks without complexifying the format. Suggestions welcomed.