miurahr / py7zr

Describe the bug
I need to process binary files from 7z archives.
Under certain circumstances, these files can contain big chunks of identical data.
When extracting said files, the process is killed (SIGKILL) as a result of being out of memory (return code 137)

To Reproduce
Please execute the following commands in order to generate a file containing 1Gb of random data and 2Gb of zeroes
failing_file=failing_file.txt && head -c 1G /dev/urandom > $failing_file && head -c 2G /dev/zero >> $failing_file

Then, create an archive (any other method can be used):
7z a failing_archive.7z $failing_file

Finally, write the test_crash.py Python script:

from py7zr import SevenZipFile

with SevenZipFile("failing_archive.7z", "r") as archive_handle:
    archive_handle.extract(targets=["failing_file.txt"])
# end with

The extraction of failing_file.txt from failing_archive.tar.gz crashes in this case:

python3 test_crash.py
echo $?

Expected behavior
failing_file.txt should be extracted

Environment (please complete the following information):

OS: Ubuntu 18 - 64 bits, 4Gb of RAM
Python 3.8.5
py7zr version: 0.18.3, installed via pip

Test data(please attach in the report):
Please refer to the 'To Reproduce' section

Additional context
I traced back the issue up to two particular instructions in compressor.py:

The _decompress method call, line 678, generates a huge chunk (decompressor.decompress, line 634) when hiting identical data (2Gb of zeroes in this case)
If it didn't fail earlier, line 684 can create an array bigger than the available memory

Is this a report of zip bomb?

You may suggests that we should limit output data size like LZMA1 in

py7zr/py7zr/compressor.py

Lines 630 to 635 in ed8243c

    
           if self._unpacked[i] < self._unpacksizes[i]: 
        
               if isinstance(decompressor, LZMA1Decompressor) or isinstance(decompressor, PpmdDecompressor): 
        
                   data = decompressor.decompress(data, max_length)  # always give max_length for lzma1 
        
               else: 
        
                   data = decompressor.decompress(data) 
        
               self._unpacked[i] += len(data)

Thanks. It seems a design bug.

Is this a report of zip bomb?

I didn't create an actual zip bomb on purpose. It just happens that some files I process have a lot zeroes.

From my point of view, py7z seems to allocate twice the amount of memory the algorithm actually requires for decompression.
Proof that this seems like an implementation concern more than an algorithmic requirement is that the official 7z binary manages to extract the file within the same environment.

Regarding the output data size limitation, my current understanding is currently too limited to comment your proposition.

Is this change feasible for the issue? @DoNck

diff --git a/py7zr/compressor.py b/py7zr/compressor.py
index 4cc230a..4dc8e4c 100644
--- a/py7zr/compressor.py
+++ b/py7zr/compressor.py
@@ -628,10 +628,7 @@ class SevenZipDecompressor:
     def _decompress(self, data, max_length: int):
         for i, decompressor in enumerate(self.chain):
             if self._unpacked[i] < self._unpacksizes[i]:
-                if isinstance(decompressor, LZMA1Decompressor) or isinstance(decompressor, PpmdDecompressor):
-                    data = decompressor.decompress(data, max_length)  # always give max_length for lzma1
-                else:
-                    data = decompressor.decompress(data)
+                data = decompressor.decompress(data, max_length)
                 self._unpacked[i] += len(data)
             elif len(data) == 0:
                 data = b""

It passes all the test cases in the project.

I've added a test case in #434 to limit memory consumption to 1GB and try to extract all zero archive.
@DoNck could you help investigating where other bad places are?

Now pass the very high compression archive case
https://github.com/miurahr/py7zr/runs/6051942489?check_suite_focus=true

A simple case in #434 reproduce a scene to extract very high compression ratio archive.

@pytest.mark.slow
def test_extract_high_compression_rate(tmp_path):
    gen = Generator()
    with py7zr.SevenZipFile(tmp_path.joinpath("target.7z"), "w") as source:
        source.writef(gen, "source")
    with limit_memory(limit):
        with py7zr.SevenZipFile(tmp_path.joinpath("target.7z"), "r") as target:
            target.extractall(path=tmp_path)

A Generator class is implement io.BufferedIOBase that produce 1GB of zeros, as like /dev/zero.

You can run the reproducible by exec pytest --no-cov --run-slow -k extract_high_compression_rate

#434 reduce memory usage but it is necessary to measure actual memory usage with profiler tool.

I've get a profile when creating and extracting an archive of raw 1GB zero data with high compression ratio with patched version. A process upper limit is 1GB

Here is a same test case without the #434 fix.

When setting process soft limit up-to 512MB

@DoNck could you confirm an improvement?

Hi, thanks for adressing this issue, I'll check this ASAP.

@DoNck You may want to change
https://github.com/miurahr/py7zr/blob/master/tests/test_misc.py#L161-L168
to reproduce your case.
You can run it by pytest --no-cov --run-slow -k extract_high_compression_rate or PYTEST_ADDOPTS="--run-slow -k high" tox -e py39

I've taken memory profiling by PYTEST_ADDOPTS="--run-slow -k high" tox -e mprof

Hi @miurahr, I can confirm this actually fixes the problem I faced in my use case, thanks a lot !

	if self._unpacked[i] < self._unpacksizes[i]:
	if isinstance(decompressor, LZMA1Decompressor) or isinstance(decompressor, PpmdDecompressor):
	data = decompressor.decompress(data, max_length) # always give max_length for lzma1
	else:
	data = decompressor.decompress(data)
	self._unpacked[i] += len(data)

OOM kill on very high compression ratio