logstash-plugins / logstash-input-s3

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Constant and frequent s3 plugin restart due to TCP connection failure

jesse-mo-aiven opened this issue · comments

Logstash information:
Please include the following information:

  1. Logstash version (e.g. bin/logstash --version) : - 8.5.3 and 8.3.3
  2. Logstash installation source (e.g. built from source, with a package manager: DEB/RPM, expanded from tar or zip archive, docker) - Through yum install from https://artifacts.elastic.co/packages/8.x/yum
  3. How is Logstash being run (e.g. as a service/service manager: systemd, upstart, etc. Via command line, docker/kubernetes) - through systemctl start logstash
  4. How was the Logstash Plugin installed - bundle install

OS version (uname -a if on a Unix-like system): - Linux 6.0.8-200.fc36.x86_64

Description of the problem including expected versus actual behavior:

We use the s3 input plugin to ingest the logs from different s3 buckets in different regions, there are small to huge s3 buckets.

  1. The plugin keeps restarting itself frequently with error message "Error: Failed to open TCP connection to bucketA.s3.eu-central-1.amazonaws.com:443 (initialize: name or service not known)" , the bucket name has been masked with "bucketA"

  2. It seems happens more frequently with a huge s3 bucket(>1million logs per day) but it also happened to small bucket (0 - 50k logs per day).

  3. The time needed to iterate the objects in a huge s3 bucket is extremely long before processing the first logs, and it contributes to the log lagging significantly. With the plugin restart error, it tends to worsen the situation.

  4. We also confirmed with AWS the DNS , Networking is fine, and found the TCPdump logs during the time frame that the same error appears, it shows a successfuly DNS resolution and returns a valid IP.

Steps to reproduce:
Please include a minimal but complete recreation of the problem,
including (e.g.) pipeline definition(s), settings, locale, etc. The easier
you make for us to reproduce it, the more likely that somebody will take the
time to look at it.

The issue seems happen more frequently with multiple pipelines, also when the bucket is huge.

logstash config:

input {
s3 {
region => "eu-central-1"
bucket => "bucketA"
prefix => "prefixA"
type => "typeA"
sincedb_path => "/var/lib/logstash/plugins/inputs/s3/sincedb_xxxxxx"
include_object_properties => "true"
}

  1. Set up multiple s3 buckets with logs constantly flowing in with different bucket size ( one of our bucket is up to 1million logs per day )
  2. Set up multiple pipelines in one logstash instance to ingest the logs from all the s3 buckets
  3. Check the logstash-plain.log to check the result

Provide logs (if relevant):

Masked some of the sensitive info.

Error logs in the logstash:
Jan 06 18:41:40 ip-x.x.x.x.eu-west-1.compute.internal logstash[12172]: Plugin: <LogStash::Inputs::S3 bucket=>"bucketA", include_object_properties=>true, prefix=>"bucketA/path", id=>"x.x.x.xx", region=>"eu-central-1", sincedb_path=>"/var/lib/logstash/plugins/inputs/s3/sincedb_x.x.x.x", type=>"log", enable_metric=>true, codec=><LogStash::Codecs::Plain id=>"plain_x.x.x.x.x", enable_metric=>true, charset=>"UTF-8">, role_session_name=>"logstash", delete=>false, interval=>60, watch_for_new_files=>true, temporary_directory=>"/tmp/logstash", gzip_pattern=>".gz(ip)?$">
Jan 06 18:41:40 ip-x.x.x.x.eu-west-1.compute.internal logstash[12172]: Error: Failed to open TCP connection to bucketA.s3.eu-central-1.amazonaws.com:443 (initialize: name or service not known)
Jan 06 18:41:40 ip-x.x.x.x.eu-west-1.compute.internal logstash[12172]: Exception: Seahorse::Client::NetworkingError
Jan 06 18:41:40 ip-x.x.x.x.eu-west-1.compute.internal logstash[12172]: Stack: /usr/share/logstash/vendor/jruby/lib/ruby/stdlib/net/http.rb:953:in `block in connect'

TCPdump logs for DNS resolution:
18:41:32.394996 IP ip-x.x.x.x.eu-west-1.compute.internal.55806 > 10.100.0.2.domain: 26908+ [1au] A? bucketA.s3.eu-central-1.amazonaws.com. (82)
18:41:32.395041 IP ip-x.x.x.x.eu-west-1.compute.internal.35320 > 10.100.0.2.domain: 21059+ [1au] AAAA? bucketA.s3.eu-central-1.amazonaws.com. (82)
18:41:32.395059 IP ip-x.x.x.x.eu-west-1.compute.internal.55960 > s3-r-w.eu-central-1.amazonaws.com.https: Flags [.], seq 161794:163226, ack 70064, win 778, length 143218:41:32.395061 IP ip-x.x.x.x.eu-west-1.compute.internal.55960 > s3-r-w.eu-central-1.amazonaws.com.https: Flags [P.], seq 163226:163647, ack 70064, win 778, length42118:41:32.396475 IP 10.100.0.2.domain > ip-x.x.x.x.eu-west-1.compute.internal.55806: 26908 2/0/1 CNAME s3-r-w.eu-central-1.amazonaws.com., A 3.5.138.144 (119)18:41:32.396537 IP 10.100.0.2.domain > ip-x.x.x.x.eu-west-1.compute.internal.35320: 21059 1/1/1 CNAME s3-r-w.eu-central-1.amazonaws.com. (185)

The IP 3.5.138.144 returned is a valid IP from amazon which has record in the internet as well.