ebiggers / libdeflate

Heavily optimized library for DEFLATE/zlib/gzip compression and decompression

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Libdeflate compressed stream in kunzip fail to decompress

tansy opened this issue · comments

I was testing your program kunzip and it turned out that libdeflate re/compressed stream cannot be decompress. I put samples in cloud, you can find recompressor here, and sample file in Silesia corpus. you can also use this program: littlezip. With libdeflate v1.19 produced zip will also fail to decompress. Also noticed that some older libdeflate don'thave this symptoms.

$ kunzip dickens.zip-ld-12
(...)
unzipping ./dickens
Checksums don't match: 1853117935 -1355125898

$ wc -c dickens
11743 dickens

I'm not certain what causes it, but let you know, maybe you can figure it out.
It may be problem in their software of in libdeflate. I cannot categorically say it. They claim to follow specification. I let you know in case it was something with libdeflate. Hope don't bother you too much.

Well, that program cannot unpack ZIP files created by zip either:

$ cd kunzip
$ make
$ zip -r examples.zip examples
$ ./kunzip examples.zip 

kunzip ZIP decompression routines
Copyright 2005-2015 - Michael Kohn <mike@mikekohn.net>
Version August 15, 2015

There are 8 files in this archive.
Unsupported compression used.
Checksums don't match: 0 -60478714

It looks like kunzip incorrectly tries to parse DEFLATE streams as zlib streams. The ZIP format never uses zlib streams. Please report this to the author of kunzip. Or just don't use kunzip.

Well, that program cannot unpack ZIP files created by zip either:

It can. `build/Makefile` doesn't work well. Somehow $ make will produce wrong output.
Use this (`-DZIP` seems to be important):
$ gcc -O2 -DZIP -o kunzip src/*.c

I have question - @ mikeakohn said that:

It's getting tripped up loading the Dynamic Huffman tables in the second block of compressed code.
(...)
The code at value 273 is 16.. meaning repeat the previous code X number of times where the value of X is the next two bits. In this case it's 3... which overflows 274 by 2 values.
unzip is fine with this because it reads in both the literal and distance codes in 1 shot in a single array. kunzip reads them separately. The spec says nothing about this.

Reason I'm asking this is this - I wanted to use libdeflate to re/compress deflate streams in zips used in epubllications. It seems to be better option then zopfli, that is very slow, but now, after this issue I realised that it may not be compatible with decompressors used in readers. I have to make sure they all will be able to decompress it. From my tests zlib, which is well supported, doesn't produce such tables. And that's my concern - if kunzip had a problem with it, despite following specification, will some, probably old, embedded decompressors cope with such input. I cannot put it in production if readers would fail on libdeflated stream anywhere.

cc @NeRdTheNed interesting compliance question?...

cc @NeRdTheNed interesting compliance question?...

This is indeed spec legal! Section 3.2.7 of RFC 1951 states "The code length repeat codes can cross from HLIT + 257 to the HDIST + 1 code lengths. In other words, all code lengths form a single sequence of HLIT + HDIST + 258 values". I'm unsure how many decompressors would have trouble with this, but it is well defined. I think ZLib may not emit repeat codes which cross from HLIT to HDIST, as it RLE compresses them separately when I last checked, so it's possible this isn't tested very well by some decompressors.

The spec says nothing about this.

That is incorrect. The DEFLATE RFC clearly says that this is allowed. Refer to section 3.2.7, Compression with dynamic Huffman codes:

"The code length repeat codes can cross from HLIT + 257 to the HDIST + 1 code lengths. In other words, all code lengths form a single sequence of HLIT + HDIST + 258 values."

I realised that it may not be compatible with decompressors used in readers.

Do you have any real world examples? The project you're pointing to doesn't look like code that's really used anywhere. Generally speaking, if you e.g. do a github search for ZIP software and pick some random buggy code that someone uploaded years ago, I don't think it's reasonable to expect everyone to work around bugs in that code. If there's evidence that this issue is more widespread and that other compressors avoid this case too, that could be different though. (That's what happened for the case of Huffman codes containing fewer than 2 codewords.)

BTW, it looks like zopfli has the same behavior as libdeflate here

Do you have any real world examples?

Simplified, with instruction inside.
If you need gzip I will make one.

if you e.g. do a github search for ZIP software and pick some random buggy code that someone uploaded years ago, I don't think it's reasonable to expect everyone to work around bugs in that code

I'm not asking you that.
Author already corrected it but without testing it I would not be aware of such thing. And out of all other compressors (zip, 7zip, zlib@pigz) only libdeflate presented this issue.
That made me realise it may be a problem, and that's why I'm asking.

Also noticed that older libdeflate didn't present this behaviour, or at least kunzip didn't trip on it.

Simplified, with instruction inside.

That's just a file that encounters this case, not a real world example of it being a problem.

Anyone can write buggy code and post it on the internet. What matters is where it is used.

And out of all other compressors (zip, 7zip, zlib@pigz) only libdeflate presented this issue.

Did you check zopfli? Hint: it does what libdeflate does.

Also noticed that older libdeflate didn't present this behaviour

Not true; it's always been like this.

zopfli hits this case 51 times on enwik8:

With num_litlen_syms=285, lens[282..287]=10
With num_litlen_syms=286, lens[285..287]=13
With num_litlen_syms=283, lens[279..284]=12
With num_litlen_syms=276, lens[272..277]=11
With num_litlen_syms=278, lens[277..281]=11
With num_litlen_syms=280, lens[276..281]=11
With num_litlen_syms=279, lens[276..281]=12
With num_litlen_syms=286, lens[285..287]=13
With num_litlen_syms=275, lens[272..276]=12
With num_litlen_syms=280, lens[275..280]=9
With num_litlen_syms=285, lens[282..286]=12
With num_litlen_syms=282, lens[277..282]=12
With num_litlen_syms=278, lens[277..281]=12
With num_litlen_syms=276, lens[272..277]=8
With num_litlen_syms=275, lens[271..276]=12
With num_litlen_syms=280, lens[277..281]=13
With num_litlen_syms=276, lens[272..277]=7
With num_litlen_syms=279, lens[275..280]=11
With num_litlen_syms=270, lens[269..273]=8
With num_litlen_syms=276, lens[273..278]=11
With num_litlen_syms=278, lens[275..279]=13
With num_litlen_syms=278, lens[274..279]=10
With num_litlen_syms=276, lens[273..278]=7
With num_litlen_syms=279, lens[277..279]=8
With num_litlen_syms=276, lens[273..278]=10
With num_litlen_syms=286, lens[282..287]=8
With num_litlen_syms=278, lens[275..280]=11
With num_litlen_syms=279, lens[274..279]=11
With num_litlen_syms=284, lens[281..286]=11
With num_litlen_syms=281, lens[276..281]=10
With num_litlen_syms=276, lens[275..279]=11
With num_litlen_syms=275, lens[272..277]=8
With num_litlen_syms=271, lens[268..273]=8
With num_litlen_syms=276, lens[275..280]=9
With num_litlen_syms=284, lens[281..286]=8
With num_litlen_syms=271, lens[268..273]=8
With num_litlen_syms=269, lens[268..272]=8
With num_litlen_syms=281, lens[280..282]=11
With num_litlen_syms=277, lens[273..278]=11
With num_litlen_syms=286, lens[285..290]=9
With num_litlen_syms=276, lens[271..276]=11
With num_litlen_syms=277, lens[274..279]=12
With num_litlen_syms=278, lens[277..281]=9
With num_litlen_syms=285, lens[284..288]=11
With num_litlen_syms=282, lens[279..283]=11
With num_litlen_syms=277, lens[274..278]=11
With num_litlen_syms=279, lens[276..281]=10
With num_litlen_syms=265, lens[261..266]=6
With num_litlen_syms=278, lens[273..278]=9
With num_litlen_syms=280, lens[276..281]=11
With num_litlen_syms=277, lens[272..277]=9

libdeflate hits it 43 times (compressing whole file at once, at level 12):

With num_litlen_syms=274, lens[273..275]=9
With num_litlen_syms=276, lens[275..277]=9
With num_litlen_syms=276, lens[273..276]=10
With num_litlen_syms=285, lens[284..286]=11
With num_litlen_syms=276, lens[273..277]=11
With num_litlen_syms=276, lens[275..277]=9
With num_litlen_syms=281, lens[280..282]=13
With num_litlen_syms=276, lens[275..277]=10
With num_litlen_syms=274, lens[271..274]=9
With num_litlen_syms=279, lens[277..279]=10
With num_litlen_syms=276, lens[275..277]=11
With num_litlen_syms=277, lens[275..277]=10
With num_litlen_syms=273, lens[272..274]=10
With num_litlen_syms=278, lens[277..281]=11
With num_litlen_syms=278, lens[277..279]=9
With num_litlen_syms=280, lens[277..281]=11
With num_litlen_syms=277, lens[275..277]=9
With num_litlen_syms=277, lens[275..277]=11
With num_litlen_syms=276, lens[275..278]=8
With num_litlen_syms=280, lens[279..281]=11
With num_litlen_syms=279, lens[278..282]=12
With num_litlen_syms=280, lens[278..280]=12
With num_litlen_syms=277, lens[275..277]=11
With num_litlen_syms=278, lens[277..279]=11
With num_litlen_syms=284, lens[283..285]=10
With num_litlen_syms=278, lens[277..279]=13
With num_litlen_syms=278, lens[276..278]=11
With num_litlen_syms=276, lens[275..277]=9
With num_litlen_syms=276, lens[274..276]=11
With num_litlen_syms=279, lens[278..280]=11
With num_litlen_syms=277, lens[275..277]=9
With num_litlen_syms=277, lens[275..278]=9
With num_litlen_syms=277, lens[275..277]=10
With num_litlen_syms=276, lens[271..276]=9
With num_litlen_syms=280, lens[277..281]=10
With num_litlen_syms=276, lens[274..276]=10
With num_litlen_syms=275, lens[273..275]=9
With num_litlen_syms=276, lens[274..276]=11
With num_litlen_syms=278, lens[277..279]=11
With num_litlen_syms=276, lens[273..276]=10
With num_litlen_syms=276, lens[274..276]=9
With num_litlen_syms=276, lens[273..277]=10
With num_litlen_syms=280, lens[278..280]=10

So the results are quite similar.

Ok, I tested zopfli and, indeed, shows the same behaviour, though later. In this case at ~900kB.
I'll put it to further test on real data as you said.

To be clear, we already know that this case can be encountered on various files. The question is does any important software actually care about it.