Millions of S3 objects takes a while to list.

Question

Millions of S3 objects takes a while to list.

jordansissel opened this issue 7 years ago · comments

Some products use S3 to store millions of tiny files, and this cause Logstash some issues when the S3 input is tasked with iterating through those millions of tiny files.

^^ Above is a heap profile over time of S3 input listing objects a bucket. At the time of this screencap, the objects counted was around 1.8 million and the growth continues as it works through the whole object list.

I think it's probably inappropriate to expect the S3 input to operate quickly in an environment where there are millions of files and no clear hints for this plugin as to how to expedite object discovery. S3's APIs are not designed for this searching we are doing.

Proposal:

I think we can reduce the memory usage by splitting object listing into multiple steps.

Today, the plugin lists S3 objects and makes an in-memory list of interesting S3 objects needing to be processed.

We can split the file listing into smaller parts by making use of either continuation-token or start-after to allow an object listing to be interrupted.

Rough proposal:

Scan bucket starting at beginning
After listing up to 10,000 objects, consider them for processing. Note the continuation-token.
(unchanged step) Process any interesting objects (ones needing to be downloaded/processed)
Scan the bucket starting at continuation-token and go to step 2.

This would reduce the amount of heap used by limiting in-flight object processing to 10,000 objects instead of the currently-unlimited size it is today.

Longer term, and out of scope: Add "listing" strategies to make efficient object listings for known patterns such as Cloudtrail which uses AWSLogs/{account}/CloudTrail/{region}/{year}/{month}{day}/ as a prefix.