ebiggers / libdeflate

Heavily optimized library for DEFLATE/zlib/gzip compression and decompression

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Reject all DEFLATE streams that zlib rejects

Dongmuliang opened this issue · comments

commented

Hi, I recently fuzz the libdeflate for parsing zlib format file and found some interesting cases.
Specifically, libdeflate accepts the file without any issue while another parser, the zlib rejects it, and I also contacted the zlib authors.

To check it whether valid or not, I use the following code (mainly from zlib_decompress/fuzz.c)

int main(int argc, char **argv)
{
	struct libdeflate_decompressor *d;
	int ret;
	int fd = open(argv[1], O_RDONLY);
	struct stat stbuf;
	assert(fd >= 0);
	ret = fstat(fd, &stbuf);
	assert(!ret);

	char in[stbuf.st_size];
	ret = read(fd, in, sizeof in);
	assert(ret == sizeof in);

	char out[sizeof(in) * 30];

	d = libdeflate_alloc_decompressor();
	size_t out_size = 0 ;

	enum libdeflate_result res = libdeflate_zlib_decompress(d, in, sizeof in, out, sizeof out, &out_size);
	printf("decode res:%d\n", res);
	libdeflate_free_decompressor(d);
	return 0;
}

These interesting files are attached!
pocs.zip

There are several edge cases where for performance reasons, libdeflate is intentionally more accepting than zlib, in a safe way. The specific case that your example triggers is the case where the encoded codeword lengths expand to more than the number of codewords. But there are a few others too.

There isn't any real problem with doing this, since in general corruption in a DEFLATE stream can only be detected by a checksum anyway.

Can you elaborate on why you consider this to be a problem?

commented

Hi, @ebiggers , thanks for your explantion. Generally, any corruption of the compressed data should be timely notified to the users because it may lead to severe effects and difficult to make recovery. This is different from uncompressed text, which may probably still be useful despite the presence of some corrupted bytes.
Therefore, keeping it silent and accepting it is not a good choice.

It seems unlikely a real problem because there is a very low possibility that both checksum and corrupted data are satisfied at the same time. However, considering its wide usage, including some critical systems, the situation will be changed when a stealthy attacker is involved (e.g., an attacker may combine other bugs to hijack the checksum function, which can be used to correct the checksum maliciously).

Hi, @ebiggers , thanks for your explantion. Generally, any corruption of the compressed data should be timely notified to the users because it may lead to severe effects and difficult to make recovery.

Yes, which is why people who want to detect data corruption need to use a checksum (e.g. as the gzip and zlib wrapper formats for DEFLATE do), and not rely on the incidental built-in redundancies of the DEFLATE format which are much, much less effective at detecting data corruption. Corrupting a DEFLATE stream will very often create another valid DEFLATE stream. In contrast, just a 32-bit checksum will detect 99.99999997% of corruptions.

The question of when the DEFLATE decompressor should report an error when it's given an invalid stream, vs. remap it to a valid stream, is really just a minor quality-of-implementation question.

I'd argue that reporting DEFLATE decompression errors is actually sort of bad, because it misleads people into thinking that DEFLATE has built-in error detection, which it doesn't. You need a checksum if you want to detect data corruption.

That being said, I do understand that zlib is the standard implementation of DEFLATE, and it's generally better to be consistent with it... if only so that people running fuzzers that compare the implementations aren't confused.

It's easy to make libdeflate return an error when "the encoded codeword lengths expand to more than the number of codewords", so I'll do that. That handles two of your examples.

However, that still leaves the fact that DEFLATE streams can contain invalid litlen and offset symbols. Those are really hard to handle efficiently, other than by remapping them to valid symbols (as libdeflate does). I am probably not going to change what libdeflate does for those, as it does not seem worth it...

the situation will be changed when a stealthy attacker is involved (e.g., an attacker may combine other bugs to hijack the checksum function, which can be used to correct the checksum maliciously).

Detecting malicious modifications is totally irrelevant here, as a cryptographic MAC would be needed for that.

65da376 handled poc1 and poc4.

2574818 would handle poc2 and poc3, but it would be a more complex change. I'm not sure it's worth merging, given that the existing behavior is safe and acceptable too. It also would only detect invalid offset symbols when they occur less than 4 GiB into the stream. I haven't found a way to detect them at positions greater than 4 GiB without adding overhead to the decompression inner loop. I don't want to slow down decompression for everyone just because of some artificial test.

commented

thanks for your explanation and bug fixing!