Betterment / delayed

a multi-threaded, SQL-driven ActiveJob backend used at Betterment to process millions of background jobs per day

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support raise_signal_exceptions

DanielStevenLewis opened this issue · comments

https://github.com/Betterment/delayed#migrating-from-delayedjob states "that some configurations, like queue_attributes, exit_on_complete, backend, and raise_signal_exceptions have been removed entirely." I think the lack of raise_signal_exceptions (and the reliance on the behaviour described in https://github.com/Betterment/delayed#running-a-worker-process) could prevent me from suggesting switching over from delayed_job to delayed. Would it be difficult to support raise_signal_exceptions and are there any concerns with the idea of supporting it?

Can you say more about what your concerns are with the delayed behavior? Delayed's behavior prioritizes finishing jobs that have begun to the extent possible before worker shutdown in an attempt not to waste work and minimize job latency. It also leans into the assumption that not every job payload will have been implemented with ideal semantic idempotency. In our view having a more opinionated and curated worker drain/deployment process is an advantage, but would love to learn more about your context.

We currently use Delayed::Worker.raise_signal_exceptions = :term with delayed_job. I'm hoping that we can switch over to delayed with minimal work/changes needed, and thereby benefit from the performance enhancements it has, as a quick win.
We restart the job servers whenever we deploy (every few days), and we have jobs that take many hours to run. I'm concerned that without this configuration option, after a deployment we'd have jobs that would take a very long time before they can retry.

Thanks for asking @jmileham . Is there more information I should try provide to better speak to your question?

So you're looking to switch to delayed but would need to extend the job timeout, and aren't looking to implement a long-lived draining period in your infra coordination right away? Makes sense. I'll tag out now because @smudge will have smarter thoughts about where to go from here.

Right! Thanks

I started looking into this on Friday, but I'll note that it's a little more complicated than simply adding the feature back. We removed it because it was incompatible with delayed's multithreading (where a single worker can claim & work off multiple jobs at once, configured via the max_claims option). Supporting raise_singal_exceptions in a way that would allow the individual job threads to rescue would require some extra signal-passing across threads that I haven't had a chance to explore yet.