miurahr / py7zr

7zip in python3 with ZStandard, PPMd, LZMA2, LZMA1, Delta, BCJ, BZip2, and Deflate compressions, and AES encryption.

Home Page:https://pypi.org/project/py7zr/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Stream-style in-memory compress and decompress functions?

ruiyuanlu opened this issue · comments

Is your feature request related to a problem? Please describe.
Features:
It would be very helpful if in-memory compres and decompress functions are available.

My case:
In order to save the limited network bandwidth, I need to compress binary data bytes, and then send the compressed data to a remote server. The server will decompress the data and do some calculation on the fly.

Describe the solution you'd like

  1. When memory is sufficient, compress/decompress all binary data bytes in memory directly:
# memory sufficient case
# local-side
import py7zr
buffer: io.BytesIO = get_buffer_of_all_bytes_from_streams()
buffer.seek(0)
compressed = py7zr.compress(buffer)
send(compressed)

# remote-side
import py7zr
data: bytes = recv()
decompressed = py7zr.decompress(data)  # decompress all data on server
  1. When memory is insufficient, use a Iterator:
# memory insufficient case
# local-side
import py7zr
bytesGenerator = get_generator_from_some_streams()
for data in bytesGenerator:
    partial_compressed = py7zr.compress(data)
    send(partial_compressed)

# remote-side
import py7zr
while not stop_flag:
    partial = recv()
    partial_decompressed = py7zr.decompress(partial)  # decompress partial data on server

Describe alternatives you've considered

  1. First, save all binary data bytes in single file, e.g., 'to_be_compressed.bin'.
  2. Second, compress the file as 'compressed.7z'.
  3. Finally send this file to remote server and decompress it.

The thing is: there are multiple unnecessary disk I/O here.

# current method

# local-sizde
import py7zr

buffer: io.BytesIO = get_buffer_of_all_bytes_from_streams()
buffer.seek(0)

# unnecessary disk write
with open("to_be_compressed.bin", "wb") as f:
    f.write(buffer)

# unnecessay disk read and write, better to make compression done only in memory
with py7zr.SevenZipFile("compressed.7z", "w") as z:
    z.write("to_be_compressed.bin", "data")

# unnecessay disk read
send("compressed.7z")

# remote-side
import py7zr
# read from network, and unnecessarily write to disk
# this could be avoid if decompression done only in memory
with open("compressed-remote.7z", "wb") as f:
    f.write(recv())

# extract archive contains additional disk read, better to make decompression done only in memory
with py7zr.SevenZipFile("compressed-remote.7z", "r") as z:
    data = z.read()['data'].getvalue()

If anyone can contribute it, you are welcome.

Thanks for the quick reply.

Just noticed that the function compress here write bytes into fp. What about let users pass an io.BytesIO object to this parameter as a shared memory to read compressed bytes in memory ?

Similar changes might be made for decompress here.

I think with these 2 slight modifications, it will be enough for the memory-sufficient case.

writef method of SevenZipFile accept io.BytesIO object. readall also returns.
What is your proposal of improvements?

https://github.com/miurahr/py7zr/blob/master/py7zr/py7zr.py#L947

https://github.com/miurahr/py7zr/blob/master/py7zr/py7zr.py#L1019-L1022


    def writef(self, bio: IO[Any], arcname: str):
        if not check_archive_path(arcname):
            raise ValueError(f"Specified path is bad: {arcname}")
        return self._writef(bio, arcname)

Do you know 7zip format requires seek to head, signature header, of the archive when compression?

Do you know 7zip format requires seek to head, signature header, of the archive when compression?

Actually, No.

Looks like compress/decompress in an iterative way is not easy to implement. I tested with writef and it works. Thanks a lot, I'll use py7zr in this way and close this comment.