Add file listing strategy optimized for CloudTrail "AWSLogs" prefixes

Question

Add file listing strategy optimized for CloudTrail "AWSLogs" prefixes

jordansissel opened this issue 7 years ago · comments

CloudTrail and similar AWS products dump millions of files into S3. Listing these is increasingly difficult as the ListObjects API on AWS is very slow. For example, listing (not reading!) 2 million objects takes 45 minutes (or longer sometimes) in my tests.

In cases where we have a predictable S3 object path, we may be able to search for interesting objects more quickly using some S3 file listing features:

CommonPrefix - allows us to basically do a "list things in this directory" without listing all objects
AWSLogs prefix have this format: AWSLogs/{account}/{product}/{region}/{year}/{month}/{day}/...

Knowing this, we could implement the following for more quickly scanning these files. For each iteration:

ListObjects AWSLogs/ w/ / delimiter and look at common prefixes, this gives us AWSLogs/{account}
ListObjects on each account to find specific product/log names: AWSLogs/{account}/CloudTrail (for example)
ListObjects on each product to find all region names: AWSLogs/{account}/CloudTrail/{region}
Compute the date range we wish to scan, say, "past 48 hours" to give us a list of {year}/{month}/{day}/ prefixes:

The above will allow us to compute all possible paths which resolve the task "List all logs for {product} for the past {time range}"

Doing so will require listing far fewer objects (not millions) and should be a nicer option.

Jordan Sissel · Answer 1 · Fri Jan 12 2018 03:30:58 GMT+0800 (China Standard Time)

This should resolve the majority of #128

Jordan Sissel · Answer 2 · Fri Jan 12 2018 03:54:25 GMT+0800 (China Standard Time)

I wrote a quick script to test the performance. In the sample S3 bucket I have, it takes on average 7 seconds to list all objects in all AWSLogs/.../{CloudTrail,CloudTrail-Digest} for a single day (in test, object counts range from 2500-3100 objects per day across many ).

Jordan Sissel · Answer 3 · Fri Jan 12 2018 04:23:01 GMT+0800 (China Standard Time)

I wonder if this could be generalized outside of AWSLogs-style patterns, that you could use wildcards and give hints about how to compute the s3 object paths.

Jordan Sissel · Answer 4 · Thu Mar 15 2018 04:12:14 GMT+0800 (China Standard Time)

Link to prototype listed in the above comment:
https://github.com/jordansissel/experiments/blob/master/ruby/aws/s3.rb

Jordan Sissel · Answer 5 · Fri Oct 05 2018 01:36:39 GMT+0800 (China Standard Time)

Proposal: Add a new setting awslogs which accepts a hash. When set, the S3 input plugin would look for files with the following pattern:

<prefix>/AWSLogs/<account id>/<product>/region/YYYY/MM/DD/...

The setting might look like this:

input {
  s3 {
    bucket => "mylogsbucket"
    awslogs => { "accounts" => 12345, "products" => "CloudTrail" }
  }
}

This would instruct the s3 input to scan the following path for any times within the past 36 hours (to catch daily rollovers on cloudtrail), for each accounts and products listed:

/AWSLogs/12345/CloudTrail/<region>/YYYY/MM/DD/...

Notes:

All regions would be fetched
YYYY, MM, DD would be populated depending on the current time and would check "today" and "yesterday"
Any already-processed files would be ignored (sincedb default behavior)
If accounts is not present, all accounts listed in the AWSLogs prefix would be scanned, so a minimal configuration for CloudTrail could be awslogs => { "products => "CloudTrail" }

CloudTrail delivers files roughly every 5 minutes (randomly across regions, it seems), so we could check for new objects every few minutes.

This solution should work for things like CloudTrail as well as VPC Flow Logs.

Jordan Sissel · Answer 6 · Fri Oct 05 2018 01:49:36 GMT+0800 (China Standard Time)

Alternate proposal, make the prefix take parameters, like:

prefix => "/AWSLogs/%{account}/%{product}/*/%{+YYYY/MM/dd}/"
prefix_parameters => {
  "account" => [ "account1", "account2" ]
  "product" => "CloudTrail"
}

For the above, for all combinations of the above parameters, it would scan each account+product+time combination.

Notes:

The syntax %{parameter} is also up for discussion. It might be confusing since this syntax is the same as event formatting.
Undecided: How to determine how far back to scan? (my default of 2 days still stands, but should this be configurable?)

MartinVUALTO · Answer 7 · Thu Nov 01 2018 19:37:19 GMT+0800 (China Standard Time)

Just implementing regex would do for most cases, so the last part of the S3 prefix doc can be removed.

I can't find a way in logstash to push to S3 using a generated prefix e.g. date, customer, product, etc. then recover it from S3 because I would need to generate the list of prefixes (I may be wrong I'm fairly new to logstash).

Jordan Sissel · Answer 8 · Wed Nov 14 2018 12:15:02 GMT+0800 (China Standard Time)

Just implementing regex would do for most cases

I am unaware of how to achieve this with the S3 api today. To my knowledge, there's no way to list objects in S3 by regex. Applying a regex on the client (logstash) would still require a full object listing in order to test each object's path against a regex, I think? Maybe we could be clever about how the regex is processed and turn /[a-z+]/[a-z]+ and use s3 prefix quries to handle it? I don't know yet. Such a contraption might be too hard to use, I wonder?

Jordan Sissel · Answer 9 · Wed Nov 14 2018 12:26:48 GMT+0800 (China Standard Time)

Glancing at open issues, I see now that #86 proposed something similar to this specific to CloudTrail (and similar s3 logs).

russellsherman · Answer 10 · Thu Aug 01 2019 07:10:54 GMT+0800 (China Standard Time)

@jordansissel any progress on efforts here or alternative approaches you've discovered since this issue was created?

akhila · Answer 11 · Wed Oct 09 2019 04:49:21 GMT+0800 (China Standard Time)

is there any update on this enhancement?

Spencer Niemi · Answer 12 · Mon Nov 11 2019 02:54:47 GMT+0800 (China Standard Time)

Is there any update on this? We are seeing significant issues with reading from an S3 bucket containing CloudTrail logs from ~50 separate accounts. Essentially, some files get processed and some don't.