Betterment / delayed

a multi-threaded, SQL-driven ActiveJob backend used at Betterment to process millions of background jobs per day

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Locks not being cleared on SIGKILL

suwyn opened this issue · comments

Our jobs are remaining locked after the worker process receives the SIGKILL (in our case from docker stop after the grace period elapses).

I'm not sure if this is by design or not, the README states (may being the keyword):

the process and may result in long-running jobs remaining locked until Delayed::Worker.max_run_time has elapsed.

In my tests they always remains locked when a job is running and Delayed also doesn't terminate with a SIGTERM.

Here is simple rake task I used to simulate the issue from Delayed
class TestWorker
  def start
    trap('TERM') { quit! }
    trap('INT') { quit! }

    500.times do |i|
      puts "Run #{i}"
      sleep 1.second
    end
  ensure
    on_exit!
  end

  def quit!
    puts 'quit!'
  end

  def on_exit!
    puts 'on_exit!'
  end
end

namespace :test do
  desc "Test signal interupts"
  task work: :environment do
    TestWorker.new.start
  end
end

If you run that as as a rake task, it won't terminate on a SIGTERM, only SIGKILL which won't execute the ensure block which in Delayed is what unlocks the jobs.

Whereas if I explicitly `exit` in the `quit!` method it works as expected.
class TestWorker
  def start
    trap('TERM') { quit! }
    trap('INT') { quit! }

    500.times do |i|
      puts "Run #{i}"
      sleep 1.second
    end
  ensure
    on_exit!
  end

  def quit!
    puts 'quit!'
    exit
  end

  def on_exit!
    puts 'on_exit!'
  end
end

namespace :test do
  desc "Test signal interupts"
  task work: :environment do
    TestWorker.new.start
  end
end

What's the recommended way to gracefully shut down a running job when the process is terminated? Should a job implement its own traps or is there a hook that Delayed offers?

  • Rails 7.0.7
  • Ruby 3.2.2
  • Delayed 0.5.0

Hi @suwyn,

Thanks for reaching out!

I've modified your example a bit to add a break if stop? check:
class TestWorker
  def start
    trap('TERM') { quit! }
    trap('INT') { quit! }

    loop do
      3.times do |i|
        puts "Run #{i}"
        sleep 1
      end
      break if stop?
    end
  ensure
    on_exit!
  end

  def stop?
    @stop
  end

  def quit!
    @stop = true
    puts 'quit!'
  end

  def on_exit!
    puts 'on_exit!'
  end
end

This more closely mirrors how delayed responds to a SIGINT or SIGTERM, by attempting to finish the current set of jobs assigned to the thread pool before exiting. (I'm simulating that with 3 "jobs," repeated forever by the "worker" in a loop.)

Here's a sample output with a SIGINT (Ctrl+C) sent during the second set of jobs:
$ rake test:work
Run 0
Run 1
Run 2
Run 0
^Cquit!
Run 1
Run 2
on_exit!

It loops around to pick up more jobs until it receives SIGINT/SIGTERM, at which point it will finish the current pool and exit cleanly instead of picking up more jobs. However, if it receives a SIGKILL instead, it will immediately exit, without attempting to clean anything up. (This signal is special and is handled by the kernel/OS -- the worker never actually receives it so there would be no way for it to react.)

What's the recommended way to gracefully shut down a running job when the process is terminated?

Right, so to answer your question, the way to gracefully shut down is to first send a SIGTERM, give jobs a chance to finish, and then send a SIGKILL later if/when you need to fully stop the process. If you're not seeing SIGTERM produce a clean exit within a reasonable amount of time, it's likely due to long-running jobs. So, this means that jobs need to be short-lived enough to complete gracefully within that waiting period, otherwise the worker will exit ungracefully and the new workers will wait until locked_at + Delayed::Worker.max_run_time in order to be sure that no other worker is running the job.

In general, I'd suggest deconstructing jobs into shorter units of work, but keep in mind that once you send the SIGTERM, you know that the worker won't pick up any new jobs, so—depending on your deployment infrastructure—you could wait a very long time before sending a SIGKILL! (Perhaps even the entire max_run_time, at which point you'll know for sure that all jobs have either completed or timed-out-with-cleanup.)

Should a job implement its own traps or is there a hook that Delayed offers?

I hadn't really considered this before. Generally I think we've found that keeping max_run_time configured to the default of 20 min (or less)—and waiting after SIGTERM for jobs to finish gracefully—has produced the best overall results (in addition to making sure that jobs are all idempotent & re-runnable). Even if a worker can't exit gracefully due to a long-running job, the longest we'd have to wait for that job to be picked up again is 20 minutes (and since this typically only affects jobs that take a long time to complete, we don't expect fast turnaround anyways).

Thanks for the explanation @smudge it all make sense.

You're correct in that the culprit for us is a long running job and that it should be broken down into shorter units of work and to keep max_run_time sensible. We had been delaying that work and kept bumping the max_run_time up to compensate 😨 and given that we're cleaning up containers 60 seconds after a deploy those jobs are staying locked/stalled for up to 3hours before they get cleared and picked up again.

While we do have an option to trap the SIGTERM in the job itself, it would feel cleaner if that came from the Job API (e.g. perhaps rescue_from) but I find myself agreeing with you again here and want to avoid that complexity.

tl;dr - We'll decompose our long running job so that it runs in multiple units, allowing us to keep max_run_time low enough to allow containers to be cleaned up in a timely manner. Thanks!