JRuby based GZIP function severely hampers performance
brennentsmith opened this issue · comments
- Version:
logstash 5.6.3
- Operating System:
Ubuntu 16.04 LTS (GNU/Linux 4.4.0-97-generic x86_64)
We've encountered that Logstash, with GZIP enabled, severely impacts performance of the logstash ingestion flow. As GZIP is a single threaded decompression engine, we have seen that a single thread is pegged at 100% while the rest sit idle. This is further confirmed with VisualVM - the primary blocking process is GZIP.
This limits our ingestion rate to around ~6000 documents per second, whereas without compression, we are able to achieve ~60,000 to ~80,000 documents per second on the same config/node.
According to the discussion here: http://code.activestate.com/lists/ruby-talk/11168/, the algorithm we use should be similar to zcat in performance. However, upon benchmarking, it appears that this is nowhere near the case. Each of these files takes about 13 minutes for logstash to process, however with the system gzip, we see the following:
time zcat 0_adn_0783_20171117_0056.log.gz | pv -r -l > /dev/null
[ 803k/s]
real 0m8.455s
user 0m7.676s
sys 0m2.800s
Currently we have been working around the issue through prefix sharding and having a ton of input jobs, but frankly that's an inefficient solution to a code level bottleneck.
The question I have is: would you be willing to accept a PR which enables the ability to call against an external decompressor?
This should be resolved with the merging of #127, which uses Java's native zlib libraries and is released in v3.2.0
bin/logstash-plugins update logstash-input-s3