read/readall dumps the decompressed files to memory, instead of streaming them
jaboja opened this issue · comments
There is a problem with reading large files, whose decompressed form exceed the available RAM:
The library (namely read/readall methods) tries to first decompress the file to memory using BytesIO, and then returns that BytesIO object. While that may work well for small files, it fails due to lack of memory, for bigger ones.
It would be better if the library streamed the files, just like the standard file IO.
To Reproduce
- Download a huge 7z file, e.g. this Wikipedia dump:
wget https://dumps.wikimedia.org/plwiki/20240301/plwiki-20240301-pages-meta-history1.xml-p1p6814.7z
- Try to read it:
import py7zr
archive = py7zr.SevenZipFile('plwiki-20240301-pages-meta-history1.xml-p1p6814.7z', mode='r')
for _ignore, content in archive.readall().items():
print(content.read(10))
- If the machine has < 36GB memory, the script will try to allocate memory until it runs out of it, then it will break.
Expected behavior
Library should allocate only as much memory as really needed for reading data requested, and allow to stream files even if their decompressed form exceeds available memory and disk space.
Environment:
- OS: Ubuntu 22.04.3 LTS
- Python 3.10.12
- py7zr version: 0.21.0
- Disk space: 10 GB
- Memory: 2 GB
(the Wikipedia dump file used as an example is 246.6 MB in compressed form, and 36 GB when decompressed)
There is a one of main loop in SevenZipFile#_extract
which is like
for f in self.files:
# if - else block
# if memory extraction
_buf = io.BytesIO()
self.worker.register_filelike(f.id, MemIO(_buf))
# else in default
self.worker.register_filelike(f.id, outfilename)
# now finished a preparation of extraction index
# then calls 7z file file pointer and target path
self.worker.extract( self.fp, path, parallel = ... )
With this structure, Worker
class create thread and extract solid blocks in muti-thread when possible.
If you want to implement, it need to be changed significantly. When you get an idea how to improve, please tell me.
py7zr originally extract files into file system, and @Zoynels contribute a memory IO feature as #111