format suggestions

Question

format suggestions

tansy opened this issue 2 years ago · comments

Here is 'Hello World!' archive:

$ echo 'Hello World!' > hw
$ ./bzip3 -e hw hw.bzip3
$ hexdump -C hw.bzip3
00000000  42 5a 33 76 31 00 00 10  00 15 00 00 00 0d 00 00  |BZ3v1...........|
00000010  00 62 a4 78 2b ff ff ff  ff 48 65 6c 6c 6f 20 57  |.b.x+....Hello W|
00000020  6f 72 6c 64 21 0a                                 |orld!.|
00000026

Importantly, not having stored compressed and uncompressed length is not what you want in archive. These are very useful and spending 20 bytes altogether for them is not much.
I guess at offset 0x09 there is compressed block size and at 0x0d is uncompressed size. Thing is they are 32-bit, that's not much these days. They should be 64-bit long.

After this it's compressed data and there is no CRC here at all.
I can understand that is pre alpha phase and stuff but you have CRC implemented and nowhere to be found in archive.

Also these values may be after (cmopressed) data, it depends what you plan for the format. Some example can be found here: lzip_manual.html#File-format.

Kamila Szewczyk · Answer 1 · Mon May 16 2022 13:45:46 GMT+0800 (China Standard Time)

What you are seeing is a single block, which is why it has 32-bit compressed and decompressed size. The SACA used has a memory upper bound at around ~5.5*block_size, hence the decision to limit the block size to ~512M. The compressed file consists of many blocks.

After this it's compressed data and there is no CRC here at all. I can understand that is pre alpha phase and stuff but you have CRC implemented and nowhere to be found in archive.

Because it's before it. 62 a4 78 2b. Please read the code first before opening issues.

Also these values may be after (cmopressed) data.

And how do you determine the length of the compressed data that you have to read from a file if it's saved after this data? Length-terminated strings don't work for the same reason.