vorner / pgz

Parallel gzip

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The parallel gzip

This is an implementation of a parallel gzip. It works by splitting the input into chunks (by default by 32MBs, but this can be configured). Each chunk is compressed independently and the results are concatenated together. Such result can be read and decompressed by the usual gzip implementation.

The motivation is to speed up transfers of large amounts of data across a fast network through ssh. The ssh throughput is limited by either its compression or encryption routines, which are single-threaded. This allows turning compression off in ssh and using multiple cores to compress the data. As the decompression is much faster, it is not necessary to use parallel decompression.

Limitations

There are certain limitations:

  • The compressed representation is slightly different than from the usual sequential gzip. Technically, the output is multiple concatenated gzips, but decompression tools commonly accept that. Furthermore, due to the independent chunks, the compression ratio is likely to be a bit worse.
  • It uses more memory, to buffer the chunks.

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

About

Parallel gzip

License:Apache License 2.0


Languages

Language:Rust 100.0%