bz3_compress() error in compression

Question

bz3_compress() error in compression

tansy opened this issue 9 months ago · comments

I tried to use the library function bz3_compress() to compress buffer, and resulting file turned out to be corrupted.

As it turned out it was 4 bytes longer than bzip3 compressed file. So I checked hexdump and, sure enough, there is this "string" or int 0x01000000 inserted after the block size.

--- README.md.bz3.hxd	2023-10-14 20:49:50.414945393 +0000
+++ README.md.bz3-lib.hxd	2023-10-14 20:47:18.278847356 +0000
@@ -1,402 +1,402 @@
-00000000  42 5a 33 76 31 00 00 10  00 fa 18 00 00 79 55 00  |BZ3v1........yU.|
-00000010  00 81 eb df 64 01 00 00  00 06 f6 54 00 00 99 55  |....d......T...U|
-00000020  00 00 f5 ff ff ff ff ff  ff ff fd 22 08 8e 06 65  |..........."...e|
-00000030  cc 2d f1 3b c7 e8 4b d4  7b 67 c3 3c d9 03 64 dc  |.-.;..K.{g.<..d.|
-00000040  65 ed de cf dc d8 90 e7  97 c2 9d 03 b4 fb 32 80  |e.............2.|
-00000050  23 00 e4 bf ea b5 32 ad  a2 41 8a a3 21 b8 81 5f  |#.....2..A..!.._|

+00000000  42 5a 33 76 31 00 04 01  00 01 00 00 00 fa 18 00  |BZ3v1...........|
+00000010  00 79 55 00 00 81 eb df  64 01 00 00 00 06 f6 54  |.yU.....d......T|
+00000020  00 00 99 55 00 00 f5 ff  ff ff ff ff ff ff fd 22  |...U..........."|
+00000030  08 8e 06 65 cc 2d f1 3b  c7 e8 4b d4 7b 67 c3 3c  |...e.-.;..K.{g.<|
+00000040  d9 03 64 dc 65 ed de cf  dc d8 90 e7 97 c2 9d 03  |..d.e...........|
+00000050  b4 fb 32 80 23 00 e4 bf  ea b5 32 ad a2 41 8a a3  |..2.#.....2..A..|

After removing these 4 bytes files decompresses fine.

You can have a look at it here. If There is something wrong on my part, also let me know.

Kamila Szewczyk · Answer 1 · Sun Oct 15 2023 05:08:44 GMT+0800 (China Standard Time)

Sorry, I don't have the software required to decompress this file installed on my computer.

A wild guess: You made some data using bz3_compress and expected the CLI to be able to unpack this.

This is a fundamental misunderstanding. bz3_compress uses a different protocol than the CLI, and the reason for that is the fact that bz3_compress is told the size of the data upfront (hence, it can put it in the header), while the CLI may not know it (because you have decided to e.g. pipe it through stdin).

tansy · Answer 2 · Sun Oct 15 2023 05:51:40 GMT+0800 (China Standard Time)

I don't have the software required to decompress that

Didn't expect that. I thought everyone knows lzip.

Or here, (bzip2).

A wild guess: You made some data using bz3_compress and expected the CLI to be able to unpack this.

Obviously. How else could I test whether the result is correct? Obviously by decompressing compressed file.

Everyone, even bzip2 does that.

This is a fundamental misunderstanding. bz3_compress uses a different protocol than the CLI, and the reason for that is the fact that bz3_compress is told the size of the data upfront (hence, it can put it in the header), while the CLI may not know it (because you have decided to e.g. pipe it through stdin).

If it's a length of uncompressed data, then it was 21881, not 1.

tansy · Answer 3 · Sun Oct 15 2023 05:59:33 GMT+0800 (China Standard Time)

BTW, you may know what is wrong with that as trying to use it as plugin I got whole bunch of linker errors

bzip3/src/libbz3.o: In function `bz3_last_error':
libbz3.c:(.text+0x6e30): multiple definition of `bz3_last_error'
bzip3/src/libbz3.o:libbz3.c:(.text+0x6e30): first defined here

(...)
bzip3/src/libbz3.o:libbz3.c:(.text+0xa900): first defined here
_lzbench/compressors.o: In function `lzbench_bzip3_compress(char*, unsigned int, char*, unsigned int, unsigned int, unsigned int, char*)':
compressors.cpp:(.text+0x405): undefined reference to `bz3_compress(unsigned int, unsigned char const*, unsigned char*, unsigned int, unsigned int*)'

You may know what is the problem here.
There is this script which is fully automatic and will download and patch lzbench. Long story short - during linking there is above error, which I don't get. Maybe you will.

dearblue · Answer 4 · Sun Oct 15 2023 12:57:24 GMT+0800 (China Standard Time)

bzip3/src/libbz3.o:libbz3.c:(.text+0xa900): first defined here
_lzbench/compressors.o: In function `lzbench_bzip3_compress(char*, unsigned int, char*, unsigned int, unsigned int, unsigned int, char*)':
compressors.cpp:(.text+0x405): undefined reference to `bz3_compress(unsigned int, unsigned char const*, unsigned char*, unsigned int, unsigned int*)'

This is a symptom of C++ with no C linkage.
I have submitted #117 as a suggested fix.

Kamila Szewczyk · Answer 5 · Sun Oct 15 2023 16:37:31 GMT+0800 (China Standard Time)

If it's a length of uncompressed data, then it was 21881, not 1.

Well, obviously it's the number that the CLI doesn't store: the amount of compressed blocks.

tansy · Answer 6 · Sun Oct 15 2023 21:16:51 GMT+0800 (China Standard Time)

Well, obviously it's the number that the CLI doesn't store: the amount of compressed blocks.

Well, obviously it is not stated in `file_format.md'. It looks like written on the lap.
I wished it looked like this.

If it is an archive format then they should not be different. 'In memory' compressor should follow client version or other way around but they should be the same.

Kamila Szewczyk · Answer 7 · Sun Oct 15 2023 21:32:01 GMT+0800 (China Standard Time)

Well, obviously it is not started in `file_format.md'. It looks like written on the lap.

Kind of, welcome to the documentation of a project with only one person making meaningful source/doc contributions and balancing work, private life, university and a bunch of other things.

As I have stated in a different ticket opened by someone else: If you want to dig through the code and improve the documentation, I can proofread it and merge it into the source trunk.

Remember that I am not paid for any of this :-). There is no benefit I have from maintaining this, and I do so in free time.

tansy · Answer 8 · Mon Oct 16 2023 07:14:06 GMT+0800 (China Standard Time)

project with only one person making

You want a medal for it? I'll give you one.

Lzip is also done by one person, and some other, related and unrelated projects too.

As I have stated in a different ticket opened by someone else: If you want to dig through the code and improve the documentation, I can proofread it and merge it into the source trunk.

This little thing can be done only by you. Who knows better the format and program then creator themselves?
But I give you head start. There is a more RFC-like file format documentation in #118, but it's not even the subject of the issue.
Subject, although "evolved", or turned out, shall I say, is about format and why is it "fuzzy", and different in different cases. That I want to address, but before you do, read this chapter please and you will know what is my stand point. As a person, who recovered many damaged archives I have to say, it is important question.