CloudTrail-optimized polling
brandond opened this issue · comments
A very common use case for S3 polling is ingest of CloudTrail logs, which have a fixed key format within a bucket:
/AWSLogs/<AccountId>/CloudTrail/<region>/<YYYY>/<MM>/<DD>/<AccountId>_CloudTrail_<region>_<ISODate>_<random>.json.gz
Given this fixed structure, ingest and incremental polling can be optimized given:
- Objects will not be rewritten or appended to once created
- Within a given account and region, only one sub-prefix (the current date) will be written to.
The process would look something like:
- Walk the prefix tree to build an initial list of
/AWSLogs/<AccountId>/CloudTrail/<region>/
prefixes - For each prefix in the list, spawn a poller thread:
- Walk the prefix tree to the first
<YYYY>/<MM>/<DD>/
sub-prefix - List objects within this prefix, paging through results using max_keys, next_continuation_token, and start_after until no further objects are returned
- When no further objects are returned, remove the
<DD>
token from current_prefix and calllist_objects_v2({prefix: parent_prefix, start_after: current_prefix})
- If a new common prefix is returned, update current_prefix and begin listing objects
- If no new prefix is returned, repeat for
<MM>
and<YYYY>
tokens - If no new sub-prefix is discovered, store last object key as start_after and sleep for a period of time
- Re-start polling loop
- Walk the prefix tree to the first
- Periodically check to see if new
/AWSLogs/<AccountId>/CloudTrail/<region>
prefixes are present and spawn new poller threads as necessary - If a poller thread's
/AWSLogs/<AccountId>/CloudTrail/<region>
prefix disappears, it should terminate.
Using the above logic, the lastdb file only needs to persist a small amount of information:
- List of
/AWSLogs/<AccountId>/CloudTrail/<region>/
prefixes with:- current_prefix (
<YYYY>/<MM>/<DD>/
) - next_continuation_token (opaque)
- start_after (last object key processed)
- current_prefix (
I am happy to work on this with an optimized poller class that could be selected via configuration option. Not sure if I should fork the current master branch, or the WIP threading branch?
EBS logs in S3 also follow a similar convention, and could easily work with this, just a slightly different prefix /AWSLogs/<accountid>/elasticloadbalancing/<region>/
so it would be awesome if this could apply to those as well