OOM kill on very high compression ratio
DoNck opened this issue · comments
Describe the bug
I need to process binary files from 7z archives.
Under certain circumstances, these files can contain big chunks of identical data.
When extracting said files, the process is killed (SIGKILL) as a result of being out of memory (return code 137)
To Reproduce
Please execute the following commands in order to generate a file containing 1Gb of random data and 2Gb of zeroes
failing_file=failing_file.txt && head -c 1G /dev/urandom > $failing_file && head -c 2G /dev/zero >> $failing_file
Then, create an archive (any other method can be used):
7z a failing_archive.7z $failing_file
Finally, write the test_crash.py
Python script:
from py7zr import SevenZipFile
with SevenZipFile("failing_archive.7z", "r") as archive_handle:
archive_handle.extract(targets=["failing_file.txt"])
# end with
The extraction of failing_file.txt from failing_archive.tar.gz crashes in this case:
python3 test_crash.py
echo $?
Expected behavior
failing_file.txt should be extracted
Environment (please complete the following information):
- OS: Ubuntu 18 - 64 bits, 4Gb of RAM
- Python 3.8.5
- py7zr version: 0.18.3, installed via pip
Test data(please attach in the report):
Please refer to the 'To Reproduce' section
Additional context
I traced back the issue up to two particular instructions in compressor.py
:
- The
_decompress
method call, line 678, generates a huge chunk (decompressor.decompress
, line 634) when hiting identical data (2Gb of zeroes in this case) - If it didn't fail earlier, line 684 can create an array bigger than the available memory
Is this a report of zip bomb?
You may suggests that we should limit output data size like LZMA1 in
Lines 630 to 635 in ed8243c
Thanks. It seems a design bug.
Is this a report of zip bomb?
I didn't create an actual zip bomb on purpose. It just happens that some files I process have a lot zeroes.
From my point of view, py7z seems to allocate twice the amount of memory the algorithm actually requires for decompression.
Proof that this seems like an implementation concern more than an algorithmic requirement is that the official 7z binary manages to extract the file within the same environment.
Regarding the output data size limitation, my current understanding is currently too limited to comment your proposition.
Is this change feasible for the issue? @DoNck
diff --git a/py7zr/compressor.py b/py7zr/compressor.py
index 4cc230a..4dc8e4c 100644
--- a/py7zr/compressor.py
+++ b/py7zr/compressor.py
@@ -628,10 +628,7 @@ class SevenZipDecompressor:
def _decompress(self, data, max_length: int):
for i, decompressor in enumerate(self.chain):
if self._unpacked[i] < self._unpacksizes[i]:
- if isinstance(decompressor, LZMA1Decompressor) or isinstance(decompressor, PpmdDecompressor):
- data = decompressor.decompress(data, max_length) # always give max_length for lzma1
- else:
- data = decompressor.decompress(data)
+ data = decompressor.decompress(data, max_length)
self._unpacked[i] += len(data)
elif len(data) == 0:
data = b""
It passes all the test cases in the project.
Now pass the very high compression archive case
https://github.com/miurahr/py7zr/runs/6051942489?check_suite_focus=true
A simple case in #434 reproduce a scene to extract very high compression ratio archive.
@pytest.mark.slow
def test_extract_high_compression_rate(tmp_path):
gen = Generator()
with py7zr.SevenZipFile(tmp_path.joinpath("target.7z"), "w") as source:
source.writef(gen, "source")
with limit_memory(limit):
with py7zr.SevenZipFile(tmp_path.joinpath("target.7z"), "r") as target:
target.extractall(path=tmp_path)
A Generator
class is implement io.BufferedIOBase
that produce 1GB of zeros, as like /dev/zero.
You can run the reproducible by exec pytest --no-cov --run-slow -k extract_high_compression_rate
#434 reduce memory usage but it is necessary to measure actual memory usage with profiler tool.
Here is a same test case without the #434 fix.
@DoNck could you confirm an improvement?
Hi, thanks for adressing this issue, I'll check this ASAP.
@DoNck You may want to change
https://github.com/miurahr/py7zr/blob/master/tests/test_misc.py#L161-L168
to reproduce your case.
You can run it by pytest --no-cov --run-slow -k extract_high_compression_rate
or PYTEST_ADDOPTS="--run-slow -k high" tox -e py39
I've taken memory profiling by PYTEST_ADDOPTS="--run-slow -k high" tox -e mprof