logstash-plugins / logstash-input-s3

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CloudTrail-optimized polling

brandond opened this issue · comments

A very common use case for S3 polling is ingest of CloudTrail logs, which have a fixed key format within a bucket:
/AWSLogs/<AccountId>/CloudTrail/<region>/<YYYY>/<MM>/<DD>/<AccountId>_CloudTrail_<region>_<ISODate>_<random>.json.gz

Given this fixed structure, ingest and incremental polling can be optimized given:

  • Objects will not be rewritten or appended to once created
  • Within a given account and region, only one sub-prefix (the current date) will be written to.

The process would look something like:

  • Walk the prefix tree to build an initial list of /AWSLogs/<AccountId>/CloudTrail/<region>/ prefixes
  • For each prefix in the list, spawn a poller thread:
    • Walk the prefix tree to the first <YYYY>/<MM>/<DD>/ sub-prefix
    • List objects within this prefix, paging through results using max_keys, next_continuation_token, and start_after until no further objects are returned
    • When no further objects are returned, remove the <DD> token from current_prefix and call list_objects_v2({prefix: parent_prefix, start_after: current_prefix})
    • If a new common prefix is returned, update current_prefix and begin listing objects
    • If no new prefix is returned, repeat for <MM> and <YYYY> tokens
    • If no new sub-prefix is discovered, store last object key as start_after and sleep for a period of time
    • Re-start polling loop
  • Periodically check to see if new /AWSLogs/<AccountId>/CloudTrail/<region> prefixes are present and spawn new poller threads as necessary
  • If a poller thread's /AWSLogs/<AccountId>/CloudTrail/<region> prefix disappears, it should terminate.

Using the above logic, the lastdb file only needs to persist a small amount of information:

  • List of /AWSLogs/<AccountId>/CloudTrail/<region>/ prefixes with:
    • current_prefix (<YYYY>/<MM>/<DD>/)
    • next_continuation_token (opaque)
    • start_after (last object key processed)

I am happy to work on this with an optimized poller class that could be selected via configuration option. Not sure if I should fork the current master branch, or the WIP threading branch?

EBS logs in S3 also follow a similar convention, and could easily work with this, just a slightly different prefix /AWSLogs/<accountid>/elasticloadbalancing/<region>/ so it would be awesome if this could apply to those as well