miurahr / py7zr

7zip in python3 with ZStandard, PPMd, LZMA2, LZMA1, Delta, BCJ, BZip2, and Deflate compressions, and AES encryption.

Home Page:https://pypi.org/project/py7zr/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

read/readall dumps the decompressed files to memory, instead of streaming them

jaboja opened this issue · comments

There is a problem with reading large files, whose decompressed form exceed the available RAM:

The library (namely read/readall methods) tries to first decompress the file to memory using BytesIO, and then returns that BytesIO object. While that may work well for small files, it fails due to lack of memory, for bigger ones.

It would be better if the library streamed the files, just like the standard file IO.

To Reproduce

  1. Download a huge 7z file, e.g. this Wikipedia dump:
wget https://dumps.wikimedia.org/plwiki/20240301/plwiki-20240301-pages-meta-history1.xml-p1p6814.7z
  1. Try to read it:
import py7zr
archive = py7zr.SevenZipFile('plwiki-20240301-pages-meta-history1.xml-p1p6814.7z', mode='r')
for _ignore, content in archive.readall().items():
    print(content.read(10))
  1. If the machine has < 36GB memory, the script will try to allocate memory until it runs out of it, then it will break.

Expected behavior
Library should allocate only as much memory as really needed for reading data requested, and allow to stream files even if their decompressed form exceeds available memory and disk space.

Environment:

  • OS: Ubuntu 22.04.3 LTS
  • Python 3.10.12
  • py7zr version: 0.21.0
  • Disk space: 10 GB
  • Memory: 2 GB

(the Wikipedia dump file used as an example is 246.6 MB in compressed form, and 36 GB when decompressed)

There is a one of main loop in SevenZipFile#_extract which is like

        for f in self.files:
             # if - else  block
             # if  memory extraction 
              _buf = io.BytesIO()
               self.worker.register_filelike(f.id, MemIO(_buf))
             # else in default
                self.worker.register_filelike(f.id, outfilename)

       # now finished a preparation of extraction index
       # then calls  7z file file pointer and target path
       self.worker.extract(   self.fp,   path, parallel = ... )

With this structure, Worker class create thread and extract solid blocks in muti-thread when possible.
If you want to implement, it need to be changed significantly. When you get an idea how to improve, please tell me.

py7zr originally extract files into file system, and @Zoynels contribute a memory IO feature as #111