logstash-plugins / logstash-input-s3

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Application load balance logs failing half way through because of gz

ryanbowden opened this issue · comments

We just moved from classical load balancer to application load balancer in AWS and now our logs are failing,

The only thing that has changed is the new AWS load balancers now export gz files.

What we see happening is it reads the file and at some point in the file just stops reading it and goes to the next file, some case the file has had hundreds of lines and it has read less than 50 lines.

I got it working by un gz the files and putting them on S3 and it ran correctly fine and put all the logs in, so the issue seems to be related to the gz part of the code.

Any information you need on our set up let us know and i will try and get it for you.

We have the same issue. 6M gz file, only ~160 lines are read. It run in logstash:5.1.1-alpine with default config.

I've been working on debugging this issue, the problem is the following: the downloaded file contains multiples stream of gzip in the same file, the current code only expects to have to deal with one stream per file resulting in skipping the other stream and missing logs lines per file.

This behavior has been there for at least 1.4.2.

To reproduce the issue, first, we isolate the code that manage the compressed files.

require 'zlib'

Zlib::GzipReader.open("./mutliple_streams.gz",  encoding: 'UTF-8') do |decoder|
    puts "one loop"
    decoder.each_line do |line|
	    print "  -  EOF?: "
    	print decoder.eof?
	    print "  -  Line Number: "
    	print decoder.lineno()
	    print "  -  Bytes read: "
	    print decoder.tell()
	    print "\n"
    end
end

We create two files:
file1

hello 1
hello 2
hello 3
hello 4
hello 5
hello 6
hello 7
hello 8
hello 9
hello 10
hello 11

file2:

more hello 12
more hello 13
more hello 14
more hello 15
more hello 16

We create an archive with multiples streams:

gzip < file1 > mutliple_streams.gz
gzip < file2 >> mutliple_streams.gz

Testing with zcat return: 16 lines

✘ph@sashimi~/stove/ex/zlib gunzip -c mutliple_streams.gz| wc -l                                                                                                                                             ⏎
16

The ruby scripts only read the first 11 lines.

The following code seems to work on MRI/Jruby, will make a proper PR.

# read from: http://code.activestate.com/lists/ruby-talk/11168/
file = File.open("./mutliple_streams.gz") do |zio|
  while true do
    io = Zlib::GzipReader.new(zio)
    io.each_line do |line|
      puts line
    end
    break if io.unused.nil?
    zio.pos -= io.unused.length # reset the position to the other block in the stream
  end
end

I am looking for another idea, io.unused will allocate the remaining of the io object to a stream which can be big.

After lurking at the Jruby/jcraft zlib I think its fine to use #unused

@ph I tested with your scripts. This is exactly the case. Looking forward for the fix.

@benjah1 I have a PR #106 with an official fix, if you want to test it out, will try to get that merged ASAP.

@benjah1 @ryanbowden I've pushed the plugin 3.1.2, you can now update it with bin/logstash-plugin update logstash-input-s3

Tested. I works perfect now. Thanks!