elastic / rally

Macrobenchmarking framework for Elasticsearch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add support for zstd-compressed corpora

danielmitterdorfer opened this issue · comments

Rally supports various compression formats such as gz or bzip. It does not support the zstd format which is perfoming significantly better in disk usage and decompression speed in my experiments. I've compressed 183GB corpus with pbzip2 and pzstd, both with the maximum compression level that is supported by the respective tool.

Format Size on disk [GB] Size on disk [GB] Relative size [%]
bzip 18613471805 18 100
zstd 11215205385 11 60

Also decompression speed is vastly superior (times measured with time, table contains the output of real, i.e. wall clock time):

Format Time to decompress [s] Relative time [%]
bzip 388 100
zstd 144 36

Therefore I propose to add support for zstd compression to Rally similar to bzip support: The fast option would require pzstd to be on PATH and a fallback can be based on the Python zstd implementation.

For reference:

  • Compress data: pzstd -19 corpus.json -o corpus.json.zstd (19 denotes the maximum compression level)
  • Decompress data: pzstd -d corpus.json.zstd -o corpus.json