bahrmichael / aws-scheduler

A scheduler for large amounts of time precise events

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question: why set TTL to 10 mins AFTER scheduled time?

theburningmonk opened this issue · comments

I'm looking at this line of code and it's not making sense to me. Why set TTL to 10 mins AFTER the scheduled time? Considering that DynamoDB TTL doesn't delete item right away (when it's not a busy table, this can be anything between 10-30 mins after an item is expired) so why would you add even more delay to that process?

tl;dr: There used to be an update after TTL was reached.

There was some logic in the emitter which changed an item's status once they were emitted. I wanted to make sure that the ddb records are still available when being emitted. The docs said that it might take longer (and all the tests confirmed that), but I didn't want to risk an error when updating an item that was already deleted.

As I've evolved the code, I moved away from updating the long term storage table towards using logs for analysis/recovery. Right now I don't see anymore reason, why the +10m on TTL would be required.

ah, so this is not the scheduled task itself, but another "update" task?

There was an update in an older version, but just a simple database write. Not a task like the scheduled events.

With the latest version you should be able to remove the +10m on the TTL.

Also I haven't heard of anyone using this in production (apart from my own hobby projects). If you're seriously considering using this code, please let me know :)

I was thinking about it for a side project. I was actually going to ping @theburningmonk about his thoughts on it, too, as the AWS landscape changes. 😁

To be honest I suggest against using this solution, if you don't really need it. Here a couple simpler solutions (far less moving parts):

  • If you can tolerate delays of 30-60 minutes: DynamoDB TTL
  • If your tasks wait up to 15 minutes: SQS with DelaySeconds
  • If your tasks wait up to 1 year: Step Functions

Only if all those don't make sense, then try to run this application :)

@bahrmichael I do like the potential of DynamoDB TTL + step functions but curious what the reason would be to hold off on using this implementation. [edit]Other than the moving parts that is...[/edit]

My reason is that step functions has far more engineers supporting it and keeping it stable, than this solution has :)

tl;dr: 🤔 there is that... 😄 point noted.

This repo is well thought out with metrics, exception handling, throttling, pagination, etc. Should the backing services change, yes that could be an issue. As your code is quite clear, it would presumably not be a huge issue. I'll go convince myself to do it the ☝️ other ways first. Thanks again for sharing the code and the writeup!

It's actually a challenge to find a nice clean way to do this with stock AWS components. What I will try is

  • create an s3 object with the payload
  • listen to s3 object create
  • examine expiration
  • if < 15 min publish to SQS
  • If > 15 min and < 48h publish to a step function (due to maximum actual delete time)
  • if > 48h write to DynamoDB with TTL to pick up actual delete with a step function

The hope is to

  • have an automatic simple backup with a sort of audit trail - no data lost and (sort of) queryable with an s3 list
  • reduce request/retry/throttling logic
  • use s3 writes to avoid DynamoDB write throttling (automatic retries at 3500 writes per sec per bucket prefix!)
  • avoid querying for expiring objects
  • keep precision

Anything simpler that can keep all the gains of the various methods?

I like that. One of my challenges was loading data from DDB in very spiky situations. A cron job that runs every minute can only do so much.

Another idea: S3 has expiration dates as well, which are processed once a day according to this blog post. But imho DDB with TTL is good enough.

Ooooh they bumped the rule limit to 1000. I missed that bit. Just have to stash things with the right prefix.

I'm essentially seeing if I can avoid the query/paginate/throttling aspects and let the built-in systems do what they do best. I can program try/catch style for unexpected cases... like the time the cleanup got backlogged...