read/readall dumps the decompressed files to memory, instead of streaming them

Question

read/readall dumps the decompressed files to memory, instead of streaming them

jaboja opened this issue 4 months ago · comments

There is a problem with reading large files, whose decompressed form exceed the available RAM:

The library (namely read/readall methods) tries to first decompress the file to memory using BytesIO, and then returns that BytesIO object. While that may work well for small files, it fails due to lack of memory, for bigger ones.

It would be better if the library streamed the files, just like the standard file IO.

To Reproduce

Download a huge 7z file, e.g. this Wikipedia dump:

wget https://dumps.wikimedia.org/plwiki/20240301/plwiki-20240301-pages-meta-history1.xml-p1p6814.7z

Try to read it:

import py7zr
archive = py7zr.SevenZipFile('plwiki-20240301-pages-meta-history1.xml-p1p6814.7z', mode='r')
for _ignore, content in archive.readall().items():
    print(content.read(10))

If the machine has < 36GB memory, the script will try to allocate memory until it runs out of it, then it will break.

Expected behavior
Library should allocate only as much memory as really needed for reading data requested, and allow to stream files even if their decompressed form exceeds available memory and disk space.

Environment:

OS: Ubuntu 22.04.3 LTS
Python 3.10.12
py7zr version: 0.21.0
Disk space: 10 GB
Memory: 2 GB

(the Wikipedia dump file used as an example is 246.6 MB in compressed form, and 36 GB when decompressed)

Hiroshi Miura · Answer 1 · Tue Apr 02 2024 15:18:07 GMT+0800 (China Standard Time)

There is a one of main loop in SevenZipFile#_extract which is like

        for f in self.files:
             # if - else  block
             # if  memory extraction 
              _buf = io.BytesIO()
               self.worker.register_filelike(f.id, MemIO(_buf))
             # else in default
                self.worker.register_filelike(f.id, outfilename)

       # now finished a preparation of extraction index
       # then calls  7z file file pointer and target path
       self.worker.extract(   self.fp,   path, parallel = ... )

With this structure, Worker class create thread and extract solid blocks in muti-thread when possible.
If you want to implement, it need to be changed significantly. When you get an idea how to improve, please tell me.

py7zr originally extract files into file system, and @Zoynels contribute a memory IO feature as #111