logstash-plugins / logstash-input-s3

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

JRuby based GZIP function severely hampers performance

brennentsmith opened this issue · comments

  • Version: logstash 5.6.3
  • Operating System: Ubuntu 16.04 LTS (GNU/Linux 4.4.0-97-generic x86_64)

We've encountered that Logstash, with GZIP enabled, severely impacts performance of the logstash ingestion flow. As GZIP is a single threaded decompression engine, we have seen that a single thread is pegged at 100% while the rest sit idle. This is further confirmed with VisualVM - the primary blocking process is GZIP.

This limits our ingestion rate to around ~6000 documents per second, whereas without compression, we are able to achieve ~60,000 to ~80,000 documents per second on the same config/node.

According to the discussion here: http://code.activestate.com/lists/ruby-talk/11168/, the algorithm we use should be similar to zcat in performance. However, upon benchmarking, it appears that this is nowhere near the case. Each of these files takes about 13 minutes for logstash to process, however with the system gzip, we see the following:

time zcat 0_adn_0783_20171117_0056.log.gz | pv -r -l > /dev/null
[ 803k/s]

real	0m8.455s
user	0m7.676s
sys	0m2.800s

Currently we have been working around the issue through prefix sharding and having a ton of input jobs, but frankly that's an inefficient solution to a code level bottleneck.

The question I have is: would you be willing to accept a PR which enables the ability to call against an external decompressor?

This should be resolved with the merging of #127, which uses Java's native zlib libraries and is released in v3.2.0

bin/logstash-plugins update logstash-input-s3