logstash-plugins / logstash-input-s3

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add file listing strategy optimized for CloudTrail "AWSLogs" prefixes

jordansissel opened this issue · comments

CloudTrail and similar AWS products dump millions of files into S3. Listing these is increasingly difficult as the ListObjects API on AWS is very slow. For example, listing (not reading!) 2 million objects takes 45 minutes (or longer sometimes) in my tests.

In cases where we have a predictable S3 object path, we may be able to search for interesting objects more quickly using some S3 file listing features:

  • CommonPrefix - allows us to basically do a "list things in this directory" without listing all objects
  • AWSLogs prefix have this format: AWSLogs/{account}/{product}/{region}/{year}/{month}/{day}/...

Knowing this, we could implement the following for more quickly scanning these files. For each iteration:

  1. ListObjects AWSLogs/ w/ / delimiter and look at common prefixes, this gives us AWSLogs/{account}
  2. ListObjects on each account to find specific product/log names: AWSLogs/{account}/CloudTrail (for example)
  3. ListObjects on each product to find all region names: AWSLogs/{account}/CloudTrail/{region}
  4. Compute the date range we wish to scan, say, "past 48 hours" to give us a list of {year}/{month}/{day}/ prefixes:

The above will allow us to compute all possible paths which resolve the task "List all logs for {product} for the past {time range}"

Doing so will require listing far fewer objects (not millions) and should be a nicer option.

This should resolve the majority of #128

I wrote a quick script to test the performance. In the sample S3 bucket I have, it takes on average 7 seconds to list all objects in all AWSLogs/.../{CloudTrail,CloudTrail-Digest} for a single day (in test, object counts range from 2500-3100 objects per day across many ).

I wonder if this could be generalized outside of AWSLogs-style patterns, that you could use wildcards and give hints about how to compute the s3 object paths.

Proposal: Add a new setting awslogs which accepts a hash. When set, the S3 input plugin would look for files with the following pattern:

<prefix>/AWSLogs/<account id>/<product>/region/YYYY/MM/DD/...

The setting might look like this:

input {
  s3 {
    bucket => "mylogsbucket"
    awslogs => { "accounts" => 12345, "products" => "CloudTrail" }
  }
}

This would instruct the s3 input to scan the following path for any times within the past 36 hours (to catch daily rollovers on cloudtrail), for each accounts and products listed:

/AWSLogs/12345/CloudTrail/<region>/YYYY/MM/DD/...

Notes:

  • All regions would be fetched
  • YYYY, MM, DD would be populated depending on the current time and would check "today" and "yesterday"
  • Any already-processed files would be ignored (sincedb default behavior)
  • If accounts is not present, all accounts listed in the AWSLogs prefix would be scanned, so a minimal configuration for CloudTrail could be awslogs => { "products => "CloudTrail" }

CloudTrail delivers files roughly every 5 minutes (randomly across regions, it seems), so we could check for new objects every few minutes.

This solution should work for things like CloudTrail as well as VPC Flow Logs.

Alternate proposal, make the prefix take parameters, like:

prefix => "/AWSLogs/%{account}/%{product}/*/%{+YYYY/MM/dd}/"
prefix_parameters => {
  "account" => [ "account1", "account2" ]
  "product" => "CloudTrail"
}

For the above, for all combinations of the above parameters, it would scan each account+product+time combination.

Notes:

  • The syntax %{parameter} is also up for discussion. It might be confusing since this syntax is the same as event formatting.
  • Undecided: How to determine how far back to scan? (my default of 2 days still stands, but should this be configurable?)

Just implementing regex would do for most cases, so the last part of the S3 prefix doc can be removed.

I can't find a way in logstash to push to S3 using a generated prefix e.g. date, customer, product, etc. then recover it from S3 because I would need to generate the list of prefixes (I may be wrong I'm fairly new to logstash).

Just implementing regex would do for most cases

I am unaware of how to achieve this with the S3 api today. To my knowledge, there's no way to list objects in S3 by regex. Applying a regex on the client (logstash) would still require a full object listing in order to test each object's path against a regex, I think? Maybe we could be clever about how the regex is processed and turn /[a-z+]/[a-z]+ and use s3 prefix quries to handle it? I don't know yet. Such a contraption might be too hard to use, I wonder?

Glancing at open issues, I see now that #86 proposed something similar to this specific to CloudTrail (and similar s3 logs).

@jordansissel any progress on efforts here or alternative approaches you've discovered since this issue was created?

is there any update on this enhancement?

Is there any update on this? We are seeing significant issues with reading from an S3 bucket containing CloudTrail logs from ~50 separate accounts. Essentially, some files get processed and some don't.