CRC-32 used as undocumented default
todd-a-jacobs opened this issue · comments
I have a job running using the following syntax:
zpaqfranz a foo.zpaq foo -m5 -verbose -xxh3 -pakka
The running job is reporting:
Integrity check type: XXH3+CRC-32
The use of CRC-32 isn't specified on the command line, and seems to occur whether or not I specify a specific xxhash or chunked format. For example, leaving off -xxh3 -pakka
just results in the output line changing to:
Integrity check type: XXHASH64+CRC-32
instead, which is not what the documentation seems to define as the default either. While I can see why the default of -xxhash
would default to xxhash64 on a 64-bit system, I'm not sure why CRC-32 is being calculated or why it is a default, especially on a 64-bit system where 32 bits would seem to invite collisions.
If you just want a fast default to add, why not use MD5, which (while cryptographically weak) is at least 128 bits? This seems like either an error in the documentation, an error in the defaults, or a sub-optimal choice for a fast and well-supported checksum.
One of the main differences between zpaqfranz and zpaq is the existence of whole file-size checksums (in fact there are even bigger differences for new tar-like format, to be completed)
In zpaq, there is no checksum (or hash) of a file, only SHA1 of its individual fragments
https://encode.su/threads/3508-Compute-overall-SHA1-from-a-SHA1-series-of-fragments
Sadly this ensures that any SHA1 collisions (there are famous PDF files in this respect) are NOT intercepted by zpaq
Short version: if you archive two files with a SHA1 collision, zpaq don't complain
https://encode.su/threads/3658-How-big-can-the-hash-slowdown-in-an-archiver-be-tolerable/page2?highlight=sha1+collision
I can make deeper explanations if you want
CRC-32 has one major difference from 'cryptographic' hashes (including MD5): it is computable in disordered and combined portions (aka: fragments)
https://encode.su/threads/3658-How-big-can-the-hash-slowdown-in-an-archiver-be-tolerable?highlight=sha1+collision
Again, you will find all the details on the development forum, or I can write here.
The short version is that zpaqfranz calculates the CRC32 during the test phase (just the t command, test), with minimal performance impact, and compares it with the CRC32 calculated during the compression phase (in short, those derived from the read file).
https://encode.su/threads/3543-How-to-quickly-compute-CRC-32-of-an-all-zeroed-buffer
You really cannot do this "thing" with deduplication on (that's why there is the w command) with a different hasher (MD5 or whatever).
Simply it is impossible AND you cannot compute with multithread, but only in monotonic single-thread run (this is the p command)
The net result is that SHA1 collisions are detected by zpaqfranz (not correct, detected).
"hidden" changes inside data will be detected too (ex. archiving a in-use file that "someone" will change, like a running VM)
This is because zpaq(franz) tries to archive practically everything it can, whether it is in use or not (after all, it is a software designed for backup, unlike other compressors).
Obviously the probability is small, but it still exists
Example
P:\vm>zpaqfranz t sift.zpaq
zpaqfranz v58.4f-JIT-LBLAKE3,SFX64 v55.1,(2023-05-21)
sift.zpaq:
1 versions, 9 files, 206.441 frags, 919 blks, 5.274.750.028 bytes (4.91 GB)
To be checked 17.286.397.757 in 8 files (32 threads)
7.15 stage time 21.38 no error detected (RAM ~514.07 MB), try CRC-32 (if any)
Checking 18.315 blocks with CRC-32 (16.713.549.149 not-0 bytes)
ERROR: STORED CRC-32 2379B9B9 != DECOMPRESSED B67F2325 (ck 00008946) sift/SIFT-Workstation-disk1.vmdk
CRC-32 time 0.45s
Blocks 16.713.549.149 ( 18.315)
Zeros negative ( 2.352) 0.156000 s
Total 14.439.568.073 speed 31.735.314.446/sec (29.56 GB/s)
ERRORS : 00000001 (ERROR in rebuilded CRC-32, SHA-1 collisions?)
--------------------------------------------------------------------------------------
GOOD : 00000007 of 00000008 (stored=decompressed)
WITH ERRORS
21.875 seconds (000:00:21) (with warnings)
7.15 stage time 21.38 no error detected (RAM ~514.07 MB), try CRC-32 (if any)
This is the first, the zpaq-based stage test
After that zpaqfranz's kicks in (if any)
In the above example the archive is good (it is extractable) but the archived data is somewhat different
In this case everything is OK
P:\backup>zpaqfranz t www.zpaq
zpaqfranz v58.4f-JIT-LBLAKE3,SFX64 v55.1,(2023-05-21)
www.zpaq:
1 versions, 2.731 files, 106.338 frags, 523 blks, 6.966.672.085 bytes (6.49 GB)
To be checked 7.378.295.954 in 2.461 files (32 threads)
7.15 stage time 25.17 no error detected (RAM ~514.07 MB), try CRC-32 (if any)
Checking 3.487 blocks with CRC-32 (7.378.213.268 not-0 bytes)
Block 00002K 6.11 GB
CRC-32 time 0.05s
Blocks 7.378.213.268 ( 3.487)
Zeros 82.686 ( 1) 0.000000 s
Total 7.378.295.954 speed 153.714.499.041/sec (143.16 GB/s)
GOOD : 00002461 of 00002461 (stored=decompressed)
VERDICT : OK (CRC-32 stored vs decompressed)
Taking CRC32 too slows down the archiving stage, and make a bigger archive
It is possible to turn off, using "straight" zpaq-style archive with -nochecksum
Since data reliability is more important to me, I use it as the default
https://encode.su/threads/3658-How-big-can-the-hash-slowdown-in-an-archiver-be-tolerable
So, by default, you get THREE different test
- SHA-1 of fragments
- CRC-32 of the all file
- XXHASH64 of all file (or different one, using for example -blake3 or whatever)
PS -pakka change only the output, it is an interface for Windows' GUI. Essentially writes less information
PS now it is 01:35, later I will fix and explain better, time to... bed :)
You can find here, as the very first difference
https://github.com/fcorbelli/zpaqfranz/wiki/Diff-against-7.15:-add
BTW, I found that Apple Silicon (notably the M2 processor) seems to be hardware-optimized for SHA-256. When I ran the zpaqfranz benchmarks, even against a terabyte or two, SHA-256 performed in your benchmark at about the same speed as XXHASH3. It might make sense to check for hardware acceleration and use SHA-256 as a default instead of XXHASH3 when the performance is going to be roughly the same since SHA-256 is cryptographically strong while the various XXHASH algorithms don't have any cryptographic properties at all.
Since I don't know how to the benchmarks are done, this may not actually be representative of real-world speeds. Still, it's at least worth thinking about since a number of other platforms also now include some form of AES hardware to speed up AES cipher operations.
You can see "under the hood" with a
zpaqfranz b -debug
(...)
Free RAM seems 43.218.018.304
1838: new ecx 2130194955
1843: new ebx 563910569
SSSE3 :OK
SSE41 :OK
SHA :OK
SHA1/2 seems supported by CPU
You need 3 "OK" to "automagically" get HW acceleration.
Rare, very rare, with Intel CPUs.
Sadly I do not like Macs very much (almost always... I use the terminal just like a FreeBSD box :)
The benchmark is very, very rude, just a quick check to get some infos on VPS' CPUs
I see your point, but the default is XXHASH (64 bit), not XXH3 (128 bit) to do not "cook off" 32 bit CPUs (not every silicon is 64 bit)
With zpaqfranz you can choose... whatever you like (almost everywhere, some exception for md5)
PS this is a "real world" example of a Intel-based server, with proxmox+FreeBSD VM, running on HDD
root@franco:/home/mog1/copie # zpaqfranz versum "./*.zpaq" -checktxt
zpaqfranz v58.4q-JIT-L,HW SHA1/2,(2023-06-22)
franz:versum | - command
franz:-checktxt
66265: Test MD5 hashes of .zpaq against _md5.txt
66136: Searching for jolly archive(s) in <<./*.zpaq>> for extension <<zpaq>>
66288: Bytes to be checked 250.899.885.678 (233.67 GB) in files 2
009% 00:29:23 ( 22.19 GB) of ( 233.67 GB) 128.799.140/SeC
As you can see the "real" bandwidth of the drive is about 128MB/s, even a 10GB/s hasher will gain nothing
One of the main differences between zpaqfranz and zpaq is the existence of whole file-size checksums (in fact there are even bigger differences for new tar-like format, to be completed) In zpaq, there is no checksum (or hash) of a file, only SHA1 of its individual fragments https://encode.su/threads/3508-Compute-overall-SHA1-from-a-SHA1-series-of-fragments
Sadly this ensures that any SHA1 collisions (there are famous PDF files in this respect) are NOT intercepted by zpaq Short version: if you archive two files with a SHA1 collision, zpaq don't complain https://encode.su/threads/3658-How-big-can-the-hash-slowdown-in-an-archiver-be-tolerable/page2?highlight=sha1+collision I can make deeper explanations if you want
CRC-32 has one major difference from 'cryptographic' hashes (including MD5): it is computable in disordered and combined portions (aka: fragments) https://encode.su/threads/3658-How-big-can-the-hash-slowdown-in-an-archiver-be-tolerable?highlight=sha1+collision Again, you will find all the details on the development forum, or I can write here. The short version is that zpaqfranz calculates the CRC32 during the test phase (just the t command, test), with minimal performance impact, and compares it with the CRC32 calculated during the compression phase (in short, those derived from the read file). https://encode.su/threads/3543-How-to-quickly-compute-CRC-32-of-an-all-zeroed-buffer You really cannot do this "thing" with deduplication on (that's why there is the w command) with a different hasher (MD5 or whatever). Simply it is impossible AND you cannot compute with multithread, but only in monotonic single-thread run (this is the p command)
The net result is that SHA1 collisions are detected by zpaqfranz (not correct, detected). "hidden" changes inside data will be detected too (ex. archiving a in-use file that "someone" will change, like a running VM) This is because zpaq(franz) tries to archive practically everything it can, whether it is in use or not (after all, it is a software designed for backup, unlike other compressors). Obviously the probability is small, but it still exists Example
P:\vm>zpaqfranz t sift.zpaq zpaqfranz v58.4f-JIT-LBLAKE3,SFX64 v55.1,(2023-05-21) sift.zpaq: 1 versions, 9 files, 206.441 frags, 919 blks, 5.274.750.028 bytes (4.91 GB) To be checked 17.286.397.757 in 8 files (32 threads) 7.15 stage time 21.38 no error detected (RAM ~514.07 MB), try CRC-32 (if any) Checking 18.315 blocks with CRC-32 (16.713.549.149 not-0 bytes) ERROR: STORED CRC-32 2379B9B9 != DECOMPRESSED B67F2325 (ck 00008946) sift/SIFT-Workstation-disk1.vmdk CRC-32 time 0.45s Blocks 16.713.549.149 ( 18.315) Zeros negative ( 2.352) 0.156000 s Total 14.439.568.073 speed 31.735.314.446/sec (29.56 GB/s) ERRORS : 00000001 (ERROR in rebuilded CRC-32, SHA-1 collisions?) -------------------------------------------------------------------------------------- GOOD : 00000007 of 00000008 (stored=decompressed) WITH ERRORS 21.875 seconds (000:00:21) (with warnings)
7.15 stage time 21.38 no error detected (RAM ~514.07 MB), try CRC-32 (if any) This is the first, the zpaq-based stage test After that zpaqfranz's kicks in (if any) In the above example the archive is good (it is extractable) but the archived data is somewhat different
In this case everything is OK
P:\backup>zpaqfranz t www.zpaq zpaqfranz v58.4f-JIT-LBLAKE3,SFX64 v55.1,(2023-05-21) www.zpaq: 1 versions, 2.731 files, 106.338 frags, 523 blks, 6.966.672.085 bytes (6.49 GB) To be checked 7.378.295.954 in 2.461 files (32 threads) 7.15 stage time 25.17 no error detected (RAM ~514.07 MB), try CRC-32 (if any) Checking 3.487 blocks with CRC-32 (7.378.213.268 not-0 bytes) Block 00002K 6.11 GB CRC-32 time 0.05s Blocks 7.378.213.268 ( 3.487) Zeros 82.686 ( 1) 0.000000 s Total 7.378.295.954 speed 153.714.499.041/sec (143.16 GB/s) GOOD : 00002461 of 00002461 (stored=decompressed) VERDICT : OK (CRC-32 stored vs decompressed)
Taking CRC32 too slows down the archiving stage, and make a bigger archive It is possible to turn off, using "straight" zpaq-style archive with -nochecksum
Since data reliability is more important to me, I use it as the default https://encode.su/threads/3658-How-big-can-the-hash-slowdown-in-an-archiver-be-tolerable So, by default, you get THREE different test
1. SHA-1 of fragments 2. CRC-32 of the all file 3. XXHASH64 of all file (or different one, using for example -blake3 or whatever)
PS -pakka change only the output, it is an interface for Windows' GUI. Essentially writes less information
Hello,
I have a question about "t" command. Is there some bug or I should worry about may data?
My use case.
On Windows Server 2019 I have DB2 database. I do dump daily and store it in zpaq file using just "a" command without any switches. On that machine I use version "zpaqfranz v57.4h-JIT-L (HW BLAKE3,SHA1),SFX64 v55.1, (12 Mar 2023)".
When I test on that server all is OK.
PS C:\instalki\zpaq715> .\zpaqfranz.exe t C:\KOPIE\backup.zpaq
zpaqfranz v57.4h-JIT-L (HW BLAKE3,SHA1),SFX64 v55.1, (12 Mar 2023)
C:/KOPIE/backup.zpaq:
15 versions, 15 files, 608.189 frags, 2.948 blks, 5.733.362.462 bytes (5.34 GB)
To be checked 442.288.750.592 in 15 files (24 threads)
7.15 stage time 353.92 no error detected (RAM ~385.55 MB), try CRC-32 (if any)
Checking 512.717 blocks with CRC-32 (429.318.293.888 not-0 bytes)
CRC-32 time 49.52s
Blocks 429.318.293.888 ( 512.717)
Zeros 12.970.456.704 ( 22.497) 6.616000 s
Total 442.288.750.592 speed 8.929.173.492/sec (8.32 GB/s)
GOOD : 00000015 of 00000015 (stored=decompressed)
VERDICT : OK (CRC-32 stored vs decompressed)
403.656 seconds (000:06:43) (all OK)
Then I transfer archive to other computer (I use filezilla resume option to download only new data).
Other computer is Windows 10 with zpaqfranz version "zpaqfranz v58.4s-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-06-23)"
And test results are:
PS K:\dir> zpaqfranz.exe t .\backup.zpaq
zpaqfranz v58.4s-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-06-23)
./backup.zpaq:
15 versions, 15 files, 5.733.362.462 bytes (5.34 GB)
To be checked 442.288.750.592 in 15 files (4 threads)
7.15 stage time 202.50 no error detected (RAM ~64.26 MB), try CRC-32 (if any)
Checking 512.717 blocks with CRC-32 (429.318.293.888 not-0 bytes)
ERROR: STORED CRC-32 77532235 != DECOMPRESSED C9E7F911 (ck 00019254) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230324223027.001
ERROR: STORED CRC-32 E3A7A2B1 != DECOMPRESSED B976C123 (ck 00021111) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230331223032.001
ERROR: STORED CRC-32 9AEAF621 != DECOMPRESSED 9849CEC6 (ck 00022809) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230407223028.001
ERROR: STORED CRC-32 55371556 != DECOMPRESSED DFE0B566 (ck 00024730) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230416223027.001
ERROR: STORED CRC-32 A09E6044 != DECOMPRESSED BB230EE2 (ck 00036567) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230506223028.001
ERROR: STORED CRC-32 A41EC1BA != DECOMPRESSED 93C71B3A (ck 00039680) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230516223032.001
ERROR: STORED CRC-32 49EA704B != DECOMPRESSED 33E65072 (ck 00040869) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230521223034.001
ERROR: STORED CRC-32 DDA93190 != DECOMPRESSED 3DD784C3 (ck 00041314) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230526223034.001
ERROR: STORED CRC-32 0B35FBDC != DECOMPRESSED 2DD981BB (ck 00040476) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230604223041.001
ERROR: STORED CRC-32 CB5A36E3 != DECOMPRESSED 2480F3C8 (ck 00042183) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230609224237.001
ERROR: STORED CRC-32 0082630B != DECOMPRESSED C2396AE7 (ck 00042683) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230615223050.001
ERROR: STORED CRC-32 E80A5C1D != DECOMPRESSED 2F7E5B91 (ck 00042703) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230701223027.001
ERROR: STORED CRC-32 AC3369F0 != DECOMPRESSED 47B5DD58 (ck 00043742) C:/KOPIE/DATABASE.0.DB2.DBPART000.20230709223032.001
ERROR: STORED CRC-32 3DFB85F8 != DECOMPRESSED D0F06B3B (ck 00010772) C:/kopie/DATABASE.0.DB2.DBPART000.20230320223031.001
ERROR: STORED CRC-32 2984D169 != DECOMPRESSED 8A141274 (ck 00043824) Z:/kopie/DATABASE.0.DB2.DBPART000.20230616233029.001
CRC-32 time 104.83s
Blocks 429.318.293.888 ( 512.717)
Zeros negative ( 72.961) 72.663000 s
Total 150.007.756.078 speed 1.430.975.742/sec (1.33 GB/s)
ERRORS : 00000015 (ERROR in rebuilded CRC-32, SHA-1 collisions?)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
WITH ERRORS
307.406 seconds (000:05:07) (with warnings)
But when I extract one file for example "C:/KOPIE/DATABASE.0.DB2.DBPART000.20230709223032.001", and get crc32 hash manually with zpaqfranz I get good stored checksum
PS C:\Users\user> zpaqfranz.exe sum D:\test\DATABASE.0.DB2.DBPART000.20230709223032.001 -crc32
zpaqfranz v58.4s-JIT-GUI-L,HW BLAKE3,SHA1/2,SFX64 v55.1,(2023-06-23)
franz:sum 1 - command
franz:-crc32
Getting CRC-32 ignoring .zfs and :$DATA
No multithread: Found (28.98 GB) => 31.112.585.216 bytes (28.98 GB) / 1 files in 0.015000
|CRC-32: AC3369F0 [ 31.112.585.216] |D:/test/DATABASE.0.DB2.DBPART000.20230709223032.001
214.296 seconds (000:03:34) (all OK)
Extracted hash AC3369F0 is equal with stored hash from "t" command
Also I calculated SHA256 checksum of extracted dump file and original file on the server and they are the same. So can I believe that stored file in zpaq archive is good?
PS1. During writing this comment I also downloaded zpaqfranz exe version from server and test is good:
PS C:\Users\user\Desktop> .\zpaqfranz.exe t K:\dir\backup.zpaq
zpaqfranz v57.4h-JIT-L (HW BLAKE3,SHA1),SFX64 v55.1, (12 Mar 2023)
K:/dir/backup.zpaq:
15 versions, 15 files, 608.189 frags, 2.948 blks, 5.733.362.462 bytes (5.34 GB)
To be checked 442.288.750.592 in 15 files (4 threads)
7.15 stage time 199.73 no error detected (RAM ~64.26 MB), try CRC-32 (if any)
Checking 512.717 blocks with CRC-32 (429.318.293.888 not-0 bytes)
CRC-32 time 32.25s
Blocks 429.318.293.888 ( 512.717)
Zeros 12.970.456.704 ( 22.497) 4.292000 s
Total 442.288.750.592 speed 13.707.154.386/sec (12.77 GB/s)
GOOD : 00000015 of 00000015 (stored=decompressed)
VERDICT : OK (CRC-32 stored vs decompressed)
232.032 seconds (000:03:52) (all OK)
So maybe there is some bug in "t" command on newer versions? Or incompatibility in archive format?
PS2. Also thank You for fantastic job continuing developing zpaq. I used original zpaq for years and it was a nice found that someone continue the job ^_^
It is a known bug, for file size (in decimal) longer than 10 chars
You can get the latest nightly build from http://www.francocorbelli.it/zpaqfranz with the bug fixed and the new -fasttxt magic computation of full archive CRC-32
Thank You for quick response and... fixing release already ^_^