Stream-style in-memory compress and decompress functions?

Question

Stream-style in-memory compress and decompress functions?

ruiyuanlu opened this issue 2 years ago · comments

Is your feature request related to a problem? Please describe.
Features:
It would be very helpful if in-memory compres and decompress functions are available.

My case:
In order to save the limited network bandwidth, I need to compress binary data bytes, and then send the compressed data to a remote server. The server will decompress the data and do some calculation on the fly.

Describe the solution you'd like

When memory is sufficient, compress/decompress all binary data bytes in memory directly:

# memory sufficient case
# local-side
import py7zr
buffer: io.BytesIO = get_buffer_of_all_bytes_from_streams()
buffer.seek(0)
compressed = py7zr.compress(buffer)
send(compressed)

# remote-side
import py7zr
data: bytes = recv()
decompressed = py7zr.decompress(data)  # decompress all data on server

When memory is insufficient, use a Iterator:

# memory insufficient case
# local-side
import py7zr
bytesGenerator = get_generator_from_some_streams()
for data in bytesGenerator:
    partial_compressed = py7zr.compress(data)
    send(partial_compressed)

# remote-side
import py7zr
while not stop_flag:
    partial = recv()
    partial_decompressed = py7zr.decompress(partial)  # decompress partial data on server

Describe alternatives you've considered

First, save all binary data bytes in single file, e.g., 'to_be_compressed.bin'.
Second, compress the file as 'compressed.7z'.
Finally send this file to remote server and decompress it.

The thing is: there are multiple unnecessary disk I/O here.

# current method

# local-sizde
import py7zr

buffer: io.BytesIO = get_buffer_of_all_bytes_from_streams()
buffer.seek(0)

# unnecessary disk write
with open("to_be_compressed.bin", "wb") as f:
    f.write(buffer)

# unnecessay disk read and write, better to make compression done only in memory
with py7zr.SevenZipFile("compressed.7z", "w") as z:
    z.write("to_be_compressed.bin", "data")

# unnecessay disk read
send("compressed.7z")

# remote-side
import py7zr
# read from network, and unnecessarily write to disk
# this could be avoid if decompression done only in memory
with open("compressed-remote.7z", "wb") as f:
    f.write(recv())

# extract archive contains additional disk read, better to make decompression done only in memory
with py7zr.SevenZipFile("compressed-remote.7z", "r") as z:
    data = z.read()['data'].getvalue()

Hiroshi Miura · Answer 1 · Mon Nov 28 2022 17:06:50 GMT+0800 (China Standard Time)

If anyone can contribute it, you are welcome.

ruiyuanlu · Answer 2 · Mon Nov 28 2022 21:03:18 GMT+0800 (China Standard Time)

Thanks for the quick reply.

Just noticed that the function compress here write bytes into fp. What about let users pass an io.BytesIO object to this parameter as a shared memory to read compressed bytes in memory ?

Similar changes might be made for decompress here.

I think with these 2 slight modifications, it will be enough for the memory-sufficient case.

Hiroshi Miura · Answer 3 · Tue Nov 29 2022 10:35:24 GMT+0800 (China Standard Time)

writef method of SevenZipFile accept io.BytesIO object. readall also returns.
What is your proposal of improvements?

https://github.com/miurahr/py7zr/blob/master/py7zr/py7zr.py#L947

https://github.com/miurahr/py7zr/blob/master/py7zr/py7zr.py#L1019-L1022


    def writef(self, bio: IO[Any], arcname: str):
        if not check_archive_path(arcname):
            raise ValueError(f"Specified path is bad: {arcname}")
        return self._writef(bio, arcname)

Hiroshi Miura · Answer 4 · Tue Nov 29 2022 10:41:42 GMT+0800 (China Standard Time)

Do you know 7zip format requires seek to head, signature header, of the archive when compression?

[Signature Header] https://py7zr.readthedocs.io/en/latest/archive_format.html#signature-header
- It has Next Header CRC. Next Header will locate at end of compressed data and its contents are decided when all the data compressed. After writing header data at end of the compressed data, then seek to top of archive and update CRC.

ruiyuanlu · Answer 5 · Tue Nov 29 2022 15:49:57 GMT+0800 (China Standard Time)

Do you know 7zip format requires seek to head, signature header, of the archive when compression?

Actually, No.

Looks like compress/decompress in an iterative way is not easy to implement. I tested with writef and it works. Thanks a lot, I'll use py7zr in this way and close this comment.