Betterment / delayed

a multi-threaded, SQL-driven ActiveJob backend used at Betterment to process millions of background jobs per day

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Concurrent job not running

AxelTheGerman opened this issue · comments

Hi there, I've been running delayed as my job backend for a while now and just started noticing that some concurrent jobs don't seem to run.

I only have 1 worker, so this might be expected but in https://github.com/Betterment/delayed#running-a-worker-process it says

By default, a worker process will pick up 2 jobs at a time (ordered by priority) and run each in a separate thread.

(That seems to be 5 by default now)

My jobs are very long running and I have:

  • Job A 6am-8am (running for 2h)
  • Job B 7am-8am (running for 1h)

Job B will never actually run - I think technically it does, but it just no-ops as it is past it's finished time.

So for my understanding, each worker will run multiple jobs in parallel but it does not check for additional jobs while 1 job is running. Even for less long running/scheduled tasks, having 100 jobs of 1 min duration, sprinkled in with some 5 min duration jobs we would sit idle every once in a while until the 5min job is completed?

Hi @AxelTheGerman!

So, yes, you've hit on something that I think is expected behavior, which is that when a worker picks up a set of jobs, it will wait until the last job in that batch completes before it attempts to pick up any additional jobs. As a result, if you have long-running jobs and need higher worker availability, you must run multiple workers at once.

A recommendation for improving queue health in general is to break long-running jobs down into shorter-running jobs (delayed is optimized for handling a huge volume of relatively short jobs, not a low volume of long-running jobs), but if that's not possible, then in addition to adding workers, you could consider having dedicated queues for particular scheduled tasks. It's also worth double checking your Delayed::Worker.max_run_time to make sure that things aren't timing out.

Hi @smudge thank you for the in depth answer... just wondering if that's worth documenting somewhere - but maybe it's edge case enough that this issue is enough documentation.

Most job frameworks seem to be optimized for a high volume of shorter jobs - I guess it's the easier problem to solve and (in most cases) write your code accordingly.

For now delayed still works good enough for me, though I'll keep my eyes open for a tool actually built for longer running jobs (making sure they didn't die, surviving re-deploys etc.)

Great tip on having multiple queues and workers per queue... would be a shame to have some important smaller jobs wait behind hour long ones :P

Hi Axel, a common perspective in modern infrastructure and service design is to keep units of work small so that they can be resumed and retried quickly in the event of infra failure or network partition. Delayed is designed for short jobs because of that quality rather than because it’s easier per se; though I admit it’s also easier. It sort of mirrors the invention and subsequent adoption of RAID arrays instead of expensive highly reliable disk drives.

We have techniques for breaking down big batches of work into small jobs that reliably complete in aggregate which I think we may look into open sourcing in the future.

Yes totally makes sense, as it works in most cases as well - not always easy but almost always works.

Unfortunately as I'm doing live recordings the job has to run as long as the recording. It's some kind of background job but doesn't fit this most common definition of it - it fits more a daemon or supervisor process... but it's scheduled as well 😅