Add filename field in logstash-input-s3

Question

Add filename field in logstash-input-s3

phirov opened this issue 8 years ago · comments

The object's name or filename in a S3 bucket is an import information which can mark the record data where it came from (especially, when there are huge files in a bucket).

There are many requirements on the stackoverflow, e.g.

Mount S3 to a local file system and use logstash-input-file is another solution to get the filename infomation. But in some circumstance, this will become more complicated when it is used in docker.

Since there is path field in logstash-input-file, why cannot include this to logstash-input-s3.

Zhou Zhou · Answer 1 · Fri Feb 03 2017 03:18:53 GMT+0800 (China Standard Time)

We need this feature too. We need to access the filenames that are synchronized from S3 to logstash for post-processing. Thank you for your great work!

Pier-Hugues Pellerin · Answer 2 · Fri Feb 24 2017 00:44:47 GMT+0800 (China Standard Time)

will be in 3.1.3 of the plugin.

Zhou Zhou · Answer 3 · Fri Feb 24 2017 00:45:51 GMT+0800 (China Standard Time)

Thanks @ph

Jacky Leung - CyberCX · Answer 4 · Fri May 26 2017 14:38:10 GMT+0800 (China Standard Time)

hi @ph i am currently using 3.1.4 but i don't see that as part of the input. not sure am i missing anything here

Todd Johnson · Answer 5 · Fri May 26 2017 23:18:05 GMT+0800 (China Standard Time)

@jk2l it may be that the name is non-obvious. Do you have [@metadata][s3][key]?

Jacky Leung - CyberCX · Answer 6 · Sat May 27 2017 20:23:54 GMT+0800 (China Standard Time)

@todd534 nope

here is the output

{
    "@timestamp" => 2017-05-27T12:19:10.033Z,
      "@version" => "1",
       "message" => "21-May-2017 08:55:19 INFO (6): Uploading Log...\n"
}

And here is my config

input {
    s3 {
        region => 'us-east-1'
        bucket => 'mybucket'
        prefix => 'prefix/path/'
    }
    stdin { }
}
output {
  stdout { codec => rubydebug }
}

I am using docker logstash:5 and here is my build

FROM logstash:5
MAINTAINER Jacky Leung <jacky@fishpond.co.nz>

RUN logstash-plugin update logstash-input-s3
RUN chown -R logstash: /usr/share/logstash/vendor/bundle/jruby/1.9/gems/

which does give me this on build

Step 3/5 : RUN logstash-plugin update logstash-input-s3
 ---> Running in 5b3de24b0d99
Updating logstash-input-s3
Updated logstash-input-s3 3.1.2 to 3.1.4

Also confirmed when login into container

root@06351116c7b5:/# logstash-plugin list --verbose logstash-input-s3
logstash-input-s3 (3.1.4)

Jacky Leung - CyberCX · Answer 7 · Sat May 27 2017 20:29:03 GMT+0800 (China Standard Time)

btw credential is using AWS_SESSION_TOKEN

docker run --rm -it \
    -e AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
    -e AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
    -e AWS_SESSION_TOKEN=$AWS_SESSION_TOKEN \
    -v $(pwd)/test/conf.d/s3input.conf:/etc/logstash/conf.d/s3input.conf  \
    <docker image ID> -f /etc/logstash/conf.d/ -i

i am testing it atm, not a permanent build

Jacky Leung - CyberCX · Answer 8 · Mon May 29 2017 04:16:31 GMT+0800 (China Standard Time)

okay nevermind... I just realise i need to add metadata => true into rubydebug

maybe the document can use some update. it took me a while to figure it out

Ihar Sadounikau · Answer 9 · Wed Aug 09 2017 23:09:08 GMT+0800 (China Standard Time)

Hello, I have the similar problem. May be someone could tell me what I'm doing wrong?

input {
  s3 {
    access_key_id => "XXXXXX"
    secret_access_key => "XXXXXXX"
    region => "eu-west-1"
    prefix => "logs/XXXXX/"
    bucket => "xbucket"
  }
}

filter {
  grok {
    match => ["message", "(?<SmartId:>[a-zA-Z0-9]+) (?<Module:>[a-zA-Z0-9._-]+) %{TIMESTAMP_ISO8601:Datetime} %{LOGLEVEL:Severity} (?<Submodule:>[a-zA-Z0-9._-]+) %{GREEDYDATA:Logmessage}" ]
    add_field => {"receive_date" => "%{@timestamp}"}
    remove_field => "message"
  }
  if "_grokparsefailure" in [tags] {
    grok {
    }
  }
  date {
    match => ["Datetime", "YYYY-MM-dd HH:mm:ss.SSS"]
    target => "@timestamp"
    remove_field => "Datetime"
  }
}

output {
          elasticsearch {
                   hosts => 'localhost:9200'
                   manage_template => false
                   index => 'logstash-%{+YYYY.MM.dd}'
                   document_type => '%{[type]}'
         }
}

And got the same result :

{
    "@timestamp" => 2017-08-7T12:19:10.033Z,
      "@version" => "1",
       "message" => "message text"
}

Is it any way to get more information? (File name, File path)

Zhou Zhou · Answer 10 · Thu Aug 17 2017 22:02:44 GMT+0800 (China Standard Time)

@Sadovnikov94
I don't extract timestamp so I can't comment on that. I'll paste my working code here with which I retrieve the file key and use part of it as the document_id. An example of the S3 file key is tcr/39_10263.txt. I extract 10263 as the unique document ID. You can compare with yours and hope it can help you find out something.

input {
  s3 {
    access_key_id => "{{ .Env.S3_KEY }}"
    secret_access_key => "{{ .Env.S3_SECRETE }}"
    bucket => "{{ .Env.S3_BUCKET }}"
    prefix => "{{ .Env.S3_PREFIX }}"
    interval => 7200
    region => "us-east-1"
  }
}

filter {
  mutate {
    add_field => {
      "file" => "%{[@metadata][s3][key]}"
    }
    grok {
      match => { "file" => "_%{NUMBER:id}.txt" }
    }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch"]
    index => "file"
    document_id => "%{id}"
    codec => rubydebug {
      metadata => true
    }
  }
}

Note: my Logstash version is 5.4.0. It appears the S3 input plugin that is bundled in Logstash is not the latest version. In my Dockerfile I have to manually update the S3 input plugin with RUN logstash-plugin update logstash-input-s3

Ihar Sadounikau · Answer 11 · Mon Aug 21 2017 18:04:08 GMT+0800 (China Standard Time)

@eye8 Thanks, It helps!

Arun Kumar Chaudhary · Answer 12 · Wed Apr 15 2020 22:52:38 GMT+0800 (China Standard Time)

@eye8 Thanks, it helped me. Just wanted to ask, is there any way to get source bucket name and backup_bucket name in filter section?