Betterment / delayed

a multi-threaded, SQL-driven ActiveJob backend used at Betterment to process millions of background jobs per day

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Clarification on idempotency requirement

stesen-wefunder opened this issue · comments

Hi! This project looks very exciting for teams looking to migrate away from delayed_job. However, I find myself having some lingering questions about the idempotency requirement of this gem:

  • Rails enqueues a lot of built-in jobs to ActiveJob. Stuff like ActiveStorage and ActionMailer. Are all these built-in jobs already idempotent? Is there something that needs to be done to make them idempotent? Or can delayed not be used with these jobs?
  • Under what circumstances do successful jobs run twice? And what's the time horizon on this? Can a job only be picked up a second time before it's completed? Or is there a window of time after some worker has fully completed the job that it can be picked up again? (This question is motivated by the idea of using Redis as an external locking mechanism to aid with idempotency with e.g., jobs that make external service calls. If a job releases the lock in Redis in after_perform would there still be a race condition?)

Massive thanks to the team at Betterment for this project!

Hi @stesen-wefunder! Thanks for reaching out!

These are some great questions -- exactly the kinds that I'd hoped this library's documentation could prompt! Here's my attempt to provide a few answers:

Are all these built-in jobs already idempotent? Is there something that needs to be done to make them idempotent?

ActiveStorage jobs should be idempotent already, insofar as AnalyzeJobs and variant processing should produce deterministic (or at least functionally equivalent) results every time. ActionMailer jobs, on the other hand, don't seem to go out of their way to stamp any kind of message_id eagerly into the serialized job data, which likely means that they are not idempotent (unless an underlying delivery API gem does something particularly fancy involving patching the deliver_later method, but I'm not aware of any gem that does this).

Or can delayed not be used with these jobs?

I wouldn't say that by any stretch. The real-world impact of an errant double-send is likely very low in practice, and particularly for emails (but for many jobs in general), at-least-once delivery is still far preferable to at-most-once delivery, even without any idempotency guarantees. So I'd argue that you still should generally use delayed with these kinds of jobs.

In the rare case where idempotency is not possible and you'd rather intervene than chance a double-send, you could set max_attempts to 1. (Erroring where attempts == max_attempts should leave a failed job row in the table until a human intervenes, but see below for caveats.)

Under what circumstances do successful jobs run twice? And what's the time horizon on this? Can a job only be picked up a second time before it's completed? Or is there a window of time after some worker has fully completed the job that it can be picked up again? (This question is motivated by the idea of using Redis as an external locking mechanism to aid with idempotency with e.g., jobs that make external service calls. If a job releases the lock in Redis in after_perform would there still be a race condition?)

So, to unpack this, there are a couple things worth knowing.

Firstly, the worker's pickup query does safely lock jobs, so you shouldn't need to use any kind of external locking mechanism to prevent double execution. (Plus, I wouldn't think that using an external datastore would provide meaningful sequencing guarantees beyond what delayed is capable of providing with its own DB connection, which benefits from all of its desirable ACID properties.)

Secondly, jobs have a maximum runtime, after which they will be forcibly stopped. Therefore, jobs that have been locked for longer than the maximum runtime (plus a hardcoded 30 second buffer) can be assumed to either be in a terminal "failed" state (attempts == max_attempts) or would have timed-out and are available again for pickup. This results in the very small possibility that a job could succeed but then fail to dequeue properly, resulting in another worker eventually picking it up again. (For example, it's possible for a worker to completely crash in between job completion and dequeue.)1

Because of this dequeue issue (i.e. the job retry/cleanup process), there is an extremely small chance that a job may be attempted twice, even if its max_attempts is 1, though in practice I've never seen this edge case occur. Nonetheless, tightening up the dequeue guarantees is something I'd like to dig deeper into for a future iteration. In the meantime, I'd suggest keeping an eye on job timeouts (bubbled out as Delayed::WorkerTimeout exceptions) and on worker crashes in general (which we monitor via our deployment infrastructure) as these are the likeliest sources of undesired double-sends for non-erroring jobs.

Footnotes

  1. As far as I'm aware, the only backend that makes an attempt to solve for this is que, with its destroy API that you must call from within the job's run method (allowing the dequeue to be co-transactional with the job's business logic). However, this API is not available if you use que as an ActiveJob backend, and I'm not aware of an ActiveJob-compatible workaround.

Thanks for the detailed response! :D

So it sounds like a job completing twice is actually an exceptionally rare case? More of a "be aware there's a chance this could happen" than an "expect and prepare for this to happen regularly"?

I had kind of misinterpreted things as a need for idempotency being the trade-off made to get better performance out of delayed_job. Like some race condition had been intentionally accepted.

So in practice, would you say most users could drop delayed in as a replacement for delayed_job and not have too many worries about this? I was preparing for a big effort to go through the whole codebase and try to add some sort of idempotency guard over hundreds of different jobs before we could make the switch (thus trying to look for easy guards like using Redis or perhaps Postgres advisory locks).

P.S., How did you make a footnote in a GitHub comment?? 🤯

More of a "be aware there's a chance this could happen" than an "expect and prepare for this to happen regularly"?

Yes, that's a good way of putting it. Though I'd argue that similar idempotency/retryability strategies will apply if your job raises an exception somewhere in the middle (especially if there are side effects external to the local DB that cannot be rolled-back), which may be a more regular occurrence.

So in practice, would you say most users could drop delayed in as a replacement for delayed_job and not have too many worries about this?

Yes, that was very much the intent! While the two libraries have drifted a bit in terms of functionality and querying strategy, the default resiliency of delayed is in many ways just a more opinionated set of default configs (e.g. not deleting failed jobs by default).

P.S., How did you make a footnote in a GitHub comment?? 🤯

Instructions here. It's one of my favorite features that they've added in the last couple years! 😄

Wonderful! Thanks for all the info! I'm moving forward with a renewed sense of ease, hahaha.

(Sorry for going quiet for a bit there, haha. I do have one other unrelated question, but I'll open another thread for that.)

I'll let you decide whether to leave this issue open for others to come across, or as a reminder to add to the docs, or you can just close it. Whatever works for you. :)

This was a very helpful conversation. I'm also considering switching from delayed_job to delayed, but concerned about the need for idempotency with delayed.

A few follow-up questions to dig into this deeper @stesen-wefunder ... Does delayed_job also have an At-Least-Once Delivery guarantee? More importantly, does delayed_job also have the behaviour that under certain conditions a job may run more than once, and if so are these conditions the same ones that delayed has? I'm mostly concerned about the possibility of introducing new edge-case behaviour wherein a job can finish successfully more than once. Your post indicates that this behaviour can occur with the delayed gem. Can this behaviour occur with delayed_job too? Your footnote above seems to have spoken to this question/concern, but I want to ensure I understood correctly that the behaviour can occur with delayed_job too. If this behaviour can occur with delayed_job too, is it just as likely to occur with delayed as it is with delayed_job? In other words, with delayed, is it as common to have jobs finish successfully more than once as with delayed_job?

Jobs being worked more than once is possible in the default configurations of both delayed_job and delayed. Delayed's opinionated worker draining should in fact reduce the incidence of jobs being worked multiple times vs delayed_job, assuming that jobs never take longer than the recommended 20 minute timeout to complete.

Thanks @jmileham ! How about if jobs take longer than 20 minutes? We have jobs that take many hours to run, and it would be a very large amount of work to decompose them into smaller jobs.

You can tune your timeout settings, but delayed's default tunings are designed to:

  • support frequent deploys (including our rule to have no more than two versions of the app 'live' at the same time during deployment)
  • be able to resume work with minimal waste/latency impact

leading us to recommend decomposing work into short-running jobs and setting a short timeout. If you are ok to leave your old worker to drain for longer than the longest running job, that could work, but you'll be subject to all the usual challenges if the worker dies for any other reason.