Betterment / delayed

a multi-threaded, SQL-driven ActiveJob backend used at Betterment to process millions of background jobs per day

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Migrating from delayed_job: replacement for Delayed::Command

valerauko opened this issue · comments

I saw that Delayed::Command has been removed from this gem in favor of running through rake.

Could you provide an example of how to achieve the same (or the replacement) behavior?

I saw that the QUEUE env is used to specify the queue used by the workers. Is MAX_CLAIMS your way of specifying concurrency? delayed_job would use OS level processes -- delayed uses threads, is that correct?

Good question! Yes, for now, MAX_CLAIMS is the way to get concurrency within a single worker process -- this allows it to make more efficient use of the job pickup query (given N threads, it will run only 1 batch pickup query per worker loop, instead of running N queries across N process forks).

We then scale our worker processes horizontally using Kubernetes pods -- our thought was that many platforms offer an equivalent form of scaling (e.g. Heroku dynos, etc). You are free to daemonize these workers however you wish and run multiple processes, but the mechanisms for performing that mode of scaling are no longer controlled by this gem.

I see. With delayed_job when a worker process died due to some unexpected error (like error message including some bytes that couldn't be saved in the text field), the process just smoked out quietly and another could be spawned. Is there some reference how this gem behaves when a worker thread encounters an unexpected condition? Does it take down the other threads in the same process too?

Good questions! I will attempt to provide answers below, but I'm also very open to any thoughts or suggestions you might have, especially if there's something that we could better document!

Does it take down the other threads in the same process too?

A crashed thread should not take down other threads. Threads are 1:1 with job executions, and those are wrapped with a catch-all rescue on this line. (We also rely on something like this plugin to get alerts on our issue tracker.)

Furthermore, our threading model relies on ruby-concurrent's FixedThreadPool, and its current behavior is to essentially swallow exceptions. So if ever errors did happen in the job-error-handling code itself (after that catch-all rescue), they would be swallowed without taking down the worker, a new thread would be spawned for a subsequent job, and we would rely on the continuous monitoring features to tell us if there are jobs that are essentially stuck in a locked state, etc. (We have a bunch of alerts in place around max job age and max lock age.)

The concurrency implementation is subject to change, of course, but we would generally aim to maintain the existing error isolation and thread pool recovery behaviors.

[...] the process just smoked out quietly and another could be spawned. Is there some reference how this gem behaves when a worker thread encounters an unexpected condition?

As a general rule, we aim to prevent the worker process itself from crashing -- like, it shouldn't happen, and if it ever does, we'd consider it a high priority bug. Once the worker loop is running, most errors would be contained within threads, so really what we're talking about would be very exceptional framework bugs (or resource contention/limits) that cause the worker process itself to crash.

In those cases, this gem doesn't attempt to restart its own processes. The folks at Heroku have a good write-up on this approach ("The Process Model"), and all it means is that some outside process manager (k8s, heroku, systemd, etc) should be in charge of handling process crashes. It should be the job of the process manager to restart failed processes and notify us (or log the occurrence) accordingly. (For example, with Kubernetes, we allow pods to restart with a CrashLoopBackOff config. And Heroku handles dyno crashes with their own exponential backoff policy.)

Hopefully that answered your questions! And if you have more questions or encounter any issues, please let us know! 😄

Thanks! I'll give the gem a try and will be back if I run into trouble!