kspalaiologos / bzip3

A better and stronger spiritual successor to BZip2.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The CRC used in bzip3 is not the CRC-32C model

dearblue opened this issue · comments

bzip3 uses a 32-bit CRC, which is different from the CRC-32C generator polynomial 0x1edc6f41.

As far as I can see, the 32-bit CRC generator polynomial used by bzip3 is 0xf26b8303.

  • The bzip3 CRC calculation is normal in the input bit direction.
    It can be read from the fact that the input is to the upper 8 bits of the CRC and the lower 24 bits are left shifted.

    while (size--) crc = crc32Table[((crc >> 24) ^ *(buf++)) & 0xff] ^ (crc << 8);


    If the input bit direction is in reverse order, input to the lower 8 bits of the CRC and shift the upper 24 bits to the right.

    https://github.com/madler/zlib/blob/04f42ceca40f73e2978b50e93806c2a18c1281fc/crc32.c#L1062

  • When checking the generator polynomial from a table with 256 elements, if the input bit direction is normal, the element of table[1] is the generator polynomial as it is.

    0x00000000L, 0xF26B8303L, 0xE13B70F7L, 0x1350F3F4L, 0xC79A971FL, 0x35F1141CL, 0x26A1E7E8L, 0xD4CA64EBL, 0x8AD958CFL,

    The generator polynomial of bzip3 is 0xf26b8303.


    For reference, to check the generator polynomial from a table whose input bit direction is in reverse order, reverse the bit direction of the value of table[128].
    Reversing the bit direction of 0x82f63b78, the current bzip3 table[128], gives 0x1edc6f41.

    % ruby -e 'n = 0x82f63b78; g = 0; 32.times { g = (g << 1) | n[0]; n >>= 1 }; p g.to_s(16)'
    "1edc6f41"

The detectability and characteristics of CRCs vary depending on the generator polynomial.
From a survey by Prof. Koopman: https://users.ece.cmu.edu/~koopman/crc/crc32.html

I don't know how detectable the generator polynomial 0xf26b8303 of the CRC used by bzip3 is.


In my opinion, it would be better to use the tables as they are and the calculations in the reverse input bit direction.
However, for compatibility reasons, the file format signature may also need to be BZ3v2?

For information, the bzip2 CRC is normal in the input direction, and the first element of the table is 0x04c11db7.

I actually tested it with the tool provided by Prof. Koopman.

On FreeBSD 13.2 amd64, building with gcc causes a crash with SIGSEGV, so I built with clang.
However, a patch was needed.

% diff -u hdlen.cpp~ hdlen.cpp
--- hdlen.cpp~  2019-01-07 04:14:52.000000000 +0900
+++ hdlen.cpp   2023-08-01 21:36:32.487597000 +0900
@@ -134,7 +134,7 @@

 // Print all computed HD lengths, putting in "?" for uncomputed lengths
 std::ostream& operator <<(std::ostream& hout, const HDLen& hval)
-{ std::_Ios_Fmtflags current_flags = hout.flags();
+{ auto current_flags = hout.flags();

   // print HD Lengths found surrounded by '{' '}'
   char separator = '{';
@@ -219,7 +219,7 @@

 // Print out all bits in undetected codeword plus the FCS
 std::ostream& operator <<(std::ostream& uout, const UndetectedClass& uval)
-{ std::_Ios_Fmtflags current_flags = uout.flags();
+{ auto current_flags = uout.flags();
   assert(uval.uLen != unusedValue); // "Must call SetLen first"

   Count_t bitsSet = 0;

% c++ -O3 -DOPTZ hdlen.cpp -std=c++11 -o hdlen

Note that the representation of the generator polynomial passed to the hdlen command requires that the upper 32 bits be taken out of the complete 33 bits.

% ./hdlen $(printf '0x%08x\n' $((0x1f26b8303 >> 1)) )   ## CRC-32 (bzip3)
Poly=0xf935c181 startHD=3 maxHD=17
# 0xf935c181  HD=3  len=9932275  Example: Len=9932276 {0} (0x80000000) (Bits=2)
# 0xf935c181  HD=4  len=9932275  Example: Len=9932276 {0} (0x80000000) (Bits=2)
# 0xf935c181  HD=5  len=3684  Example: Len=3685 {0,1063,1932} (0x80000000) (Bits=4)
# 0xf935c181  HD=6  len=3684  Example: Len=3685 {0,1063,1932} (0x80000000) (Bits=4)
# 0xf935c181  HD=7  len=178  Example: Len=179 {0,50,83,88,105} (0x80000000) (Bits=6)
# 0xf935c181  HD=8  len=178  Example: Len=179 {0,50,83,88,105} (0x80000000) (Bits=6)
# 0xf935c181  HD=9  len=42  Example: Len=43 {0,1,8,15,16} (0x80020800) (Bits=8)
# 0xf935c181  HD=10  len=42  Example: Len=43 {0,1,8,15,16} (0x80020800) (Bits=8)
# 0xf935c181  HD=11  len=17  Example: Len=18 {0,9} (0x90174008) (Bits=10)
# 0xf935c181  HD=12  len=17  Example: Len=18 {0,9} (0x90174008) (Bits=10)
# 0xf935c181  HD=13  len=8  Example: Len=9 {0,4,7,8} (0x80890609) (Bits=12)
# 0xf935c181  HD=14  len=8  Example: Len=9 {0,4,7,8} (0x80890609) (Bits=12)
# 0xf935c181  HD=15  len=1  Example: Len=2 {0} (0x85af2141) (Bits=14)
# 0xf935c181  HD=16  len=1  Example: Len=2 {0} (0x85af2141) (Bits=14)
# 0xf935c181  HD=17  NONE  Example: Len=1 {0} (0xf935c181) (Bits=16)
0xf935c181 {9932275,9932275,3684,3684,178,178,42,42,17,17,8,8,1,1}

% ./hdlen $(printf '0x%08x\n' $((0x104c11db7 >> 1)) )   ## CRC-32
Poly=0x82608edb startHD=3 maxHD=16
...
0x82608edb {4294967263,91607,2974,268,171,91,57,34,21,12,10,10,10}

% ./hdlen $(printf '0x%08x\n' $((0x11edc6f41 >> 1)) )   ## CRC-32C
Poly=0x8f6e37a0 startHD=3 maxHD=19
...
0x8f6e37a0 {2147483615,2147483615,5243,5243,177,177,47,47,20,20,8,8,6,6,1,1}

🙀 I'm no specialist, so I couldn't read the results. 🙈

Ah, I checked the commit log and it appears that the problem is now caused by 7a8d35b.
The issue leads to #105, so I'm closing this.

I would like to add a slight addition that I noticed later.
😸 Not a bad story by any means.

The CRC used in bzip3 is not strictly speaking a CRC-32C.
I'm using the same usage as the first commit 3bf96e2 at the moment and that's my conclusion drawn from it.

% git grep -wn crc32sum 3bf96e23c7827fdb5286407dfcd29181ce40f50e
3bf96e23c7827fdb5286407dfcd29181ce40f50e:crc32.h:60:static uint32_t crc32sum(uint32_t crc, uint8_t *buf, size_t size) {
3bf96e23c7827fdb5286407dfcd29181ce40f50e:main.c:83:            uint32_t crc32 = crc32sum(1, buffer, bytes_read);
3bf96e23c7827fdb5286407dfcd29181ce40f50e:main.c:151:            if (crc32sum(1, output, bytes_read) != crc32) {

However, since the generating polynomial is the same as that of CRC-32C, I believe the detection performance is equivalent.
As a amateur's conclusion, there is nothing wrong with the current version.

A comparison with CRC-32C is shown below.

Property CRC-32C CRC-32/BZIP3 (temporary name just when here) Note
Bit width 32 32
Generator polynomial 0x1edc6f41 0x1edc6f41
Initial value 0 1
Initial internal state 0xffffffff 1 (Initial value) XOR (XOR output)
Reflect input true true
Reflect output true true
XOR output 0xffffffff 0
Result of "123456789" 0xe3069283 0xacdd2c68
Residue 0xb798b438 0
Magic number of CRC 0x48674bc7 0 (Residue) XOR (XOR output)

To repeat the amateur's conclusion, there is nothing wrong with the current version.

In case anyone is concerned, I leave the following example patch for changing to CRC-32C. I probably won't send a PR, however.

diff --git a/src/libbz3.c b/src/libbz3.c
index 15c327a..d8a521c 100644
--- a/src/libbz3.c
+++ b/src/libbz3.c
@@ -61,8 +61,9 @@ static const u32 crc32Table[256] = {
 };

 static u32 crc32sum(u32 crc, u8 * RESTRICT buf, size_t size) {
+    crc ^= 0xffffffff;
     while (size--) crc = crc32Table[((u8)crc ^ *(buf++)) & 0xff] ^ (crc >> 8);
-    return crc;
+    return crc ^ 0xffffffff;
 }

 /* LZP code. These constants were manually tuned to give the best compression ratio while using relatively
@@ -553,7 +554,7 @@ BZIP3_API s32 bz3_encode_block(struct bz3_state * state, u8 * buffer, s32 data_s
         return -1;
     }

-    u32 crc32 = crc32sum(1, b1, data_size);
+    u32 crc32 = crc32sum(0, b1, data_size);

     // Ignore small blocks. They won't benefit from the entropy coding step.
     if (data_size < 64) {
@@ -634,7 +635,7 @@ BZIP3_API s32 bz3_decode_block(struct bz3_state * state, u8 * buffer, s32 data_s

         memmove(buffer, buffer + 8, data_size - 8);

-        if (crc32sum(1, buffer, data_size - 8) != crc32) {
+        if (crc32sum(0, buffer, data_size - 8) != crc32) {
             state->last_error = BZ3_ERR_CRC;
             return -1;
         }
@@ -726,7 +727,7 @@ BZIP3_API s32 bz3_decode_block(struct bz3_state * state, u8 * buffer, s32 data_s

     if (b1 != buffer) memcpy(buffer, b1, size_src);

-    if (crc32 != crc32sum(1, buffer, size_src)) {
+    if (crc32 != crc32sum(0, buffer, size_src)) {
         state->last_error = BZ3_ERR_CRC;
         return -1;
     }