Stream-style in-memory compress and decompress functions?
ruiyuanlu opened this issue · comments
Is your feature request related to a problem? Please describe.
Features:
It would be very helpful if in-memory compres and decompress functions are available.
My case:
In order to save the limited network bandwidth, I need to compress binary data bytes, and then send the compressed data to a remote server. The server will decompress the data and do some calculation on the fly.
Describe the solution you'd like
- When memory is sufficient, compress/decompress all binary data bytes in memory directly:
# memory sufficient case
# local-side
import py7zr
buffer: io.BytesIO = get_buffer_of_all_bytes_from_streams()
buffer.seek(0)
compressed = py7zr.compress(buffer)
send(compressed)
# remote-side
import py7zr
data: bytes = recv()
decompressed = py7zr.decompress(data) # decompress all data on server
- When memory is insufficient, use a Iterator:
# memory insufficient case
# local-side
import py7zr
bytesGenerator = get_generator_from_some_streams()
for data in bytesGenerator:
partial_compressed = py7zr.compress(data)
send(partial_compressed)
# remote-side
import py7zr
while not stop_flag:
partial = recv()
partial_decompressed = py7zr.decompress(partial) # decompress partial data on server
Describe alternatives you've considered
- First, save all binary data bytes in single file, e.g., 'to_be_compressed.bin'.
- Second, compress the file as 'compressed.7z'.
- Finally send this file to remote server and decompress it.
The thing is: there are multiple unnecessary disk I/O here.
# current method
# local-sizde
import py7zr
buffer: io.BytesIO = get_buffer_of_all_bytes_from_streams()
buffer.seek(0)
# unnecessary disk write
with open("to_be_compressed.bin", "wb") as f:
f.write(buffer)
# unnecessay disk read and write, better to make compression done only in memory
with py7zr.SevenZipFile("compressed.7z", "w") as z:
z.write("to_be_compressed.bin", "data")
# unnecessay disk read
send("compressed.7z")
# remote-side
import py7zr
# read from network, and unnecessarily write to disk
# this could be avoid if decompression done only in memory
with open("compressed-remote.7z", "wb") as f:
f.write(recv())
# extract archive contains additional disk read, better to make decompression done only in memory
with py7zr.SevenZipFile("compressed-remote.7z", "r") as z:
data = z.read()['data'].getvalue()
If anyone can contribute it, you are welcome.
Thanks for the quick reply.
Just noticed that the function compress
here write bytes into fp
. What about let users pass an io.BytesIO
object to this parameter as a shared memory to read compressed bytes in memory ?
Similar changes might be made for decompress
here.
I think with these 2 slight modifications, it will be enough for the memory-sufficient case.
writef
method of SevenZipFile
accept io.BytesIO
object. readall
also returns.
What is your proposal of improvements?
https://github.com/miurahr/py7zr/blob/master/py7zr/py7zr.py#L947
https://github.com/miurahr/py7zr/blob/master/py7zr/py7zr.py#L1019-L1022
def writef(self, bio: IO[Any], arcname: str):
if not check_archive_path(arcname):
raise ValueError(f"Specified path is bad: {arcname}")
return self._writef(bio, arcname)
Do you know 7zip format requires seek
to head, signature header, of the archive when compression?
- [Signature Header] https://py7zr.readthedocs.io/en/latest/archive_format.html#signature-header
- It has Next Header CRC. Next Header will locate at end of compressed data and its contents are decided when all the data compressed. After writing header data at end of the compressed data, then seek to top of archive and update CRC.
Do you know 7zip format requires
seek
to head, signature header, of the archive when compression?
Actually, No.
Looks like compress/decompress in an iterative way is not easy to implement. I tested with writef
and it works. Thanks a lot, I'll use py7zr in this way and close this comment.