How to properly handle interrupts
jgrau opened this issue · comments
Hi @bensheldon
I am using Sentry Cron Monitoring and have something that roughly translates to
around_perform do |_job, block|
# slug and monitor configs are irrelevant for this example but they exist :)
check_in_id = Sentry.capture_check_in(slug, :in_progress, monitor_config:)
start = Sentry.utc_now.to_i
begin
block.call
duration = Sentry.utc_now.to_i - start
Sentry.capture_check_in(
slug,
:ok,
check_in_id:,
duration:,
monitor_config:,
)
rescue StandardError
duration = Sentry.utc_now.to_i - start
Sentry.capture_check_in(
slug,
:error,
check_in_id:,
duration:,
monitor_config:,
)
raise
end
end
in application_job.rb
.
I have a problem when a job is interrupted while performing. The job has successfully "checked in" but since it is interrupted it never gets the chance to to update the checkin with either :ok
or :error
. Meanwhile, the second execution of the job (the automatic retry) performs another checkin and probably also a correct :ok
update but as far as Sentry knows, the original execution never finishes and receives a "timeout".
I figured I could solve that with something along the lines of
# Use this to control interrupts
# https://github.com/bensheldon/good_job#Interrupts-graceful-shutdown-and-SIGKILL
include GoodJob::ActiveJobExtensions::InterruptErrors
# Retry the job if it is interrupted
retry_on GoodJob::InterruptError, wait: 0, attempts: :unlimited
around_perform do |_job, block|
# slug and monitor configs are irrelevant for this example but they exist :)
check_in_id = Sentry.capture_check_in(slug, :in_progress, monitor_config:)
start = Sentry.utc_now.to_i
begin
block.call
duration = Sentry.utc_now.to_i - start
Sentry.capture_check_in(
slug,
:ok,
check_in_id:,
duration:,
monitor_config:,
)
rescue GoodJob::InterruptError
# If a job is interrupted by GoodJob it will be automatically retried
# and as it retries it will check in with sentry again, however, without
# this rescue block the cron checkin of the original job would never
# receive a status update and thus be marked as exceeded the max runtime
# (duration error).
duration = Sentry.utc_now.to_i - start
Sentry.capture_check_in(slug, :ok, check_in_id:, duration:)
raise
rescue StandardError
duration = Sentry.utc_now.to_i - start
Sentry.capture_check_in(
slug,
:error,
check_in_id:,
duration:,
monitor_config:,
)
raise
end
end
But as far as I can tell, I don't get to rescue the interrupt error.
Can you tell if my proposed solution "should work" (I might have made a stupid mistake somewhere) or whether I am going down a wrong path...
hmmm. I just doublechecked and it seems like the retry_on
handler will be called after the perform callbacks:
I'm not sure, though, in what order Active Job callbacks run. Could you try defining your around_perform
before you include GoodJob::ActiveJobExtensions::InterruptErrors
. That itself is an around_perform
:
good_job/lib/good_job/active_job_extensions/interrupt_errors.rb
Lines 8 to 12 in fbd1a5b
Aha, found it:
Calls the before and around callbacks in the order they were set
So if I'm understanding this correctly because these are wrapping, you'll want your own custom around_perform
wrapper to be invoked before you include GoodJob::ActiveJobExtensions::InterruptErrors
.
If that works, that might be an nice note to add to the Readme.
I think the reordering of callbacks does allow me to rescue the interrupt error but it also seems as if it's the second execution that is raising the exception (that I can now rescue). Does that make sense? I think I need to be able to rescue (run something) when the first execution is shutting down... 🤔
Oh! Yes, the InterruptError is only invoked on the 2nd execution because the first execution was interrupted (SIGKILL).
You can't call code during a SIGKILL. The best you can do is put your code in an ensure
.
Unless I'm misunderstanding the circumstances under which your job is being interrupted.
Thank you! I think I have something working:
around_perform do |_job, block|
# slug and monitor configs are irrelevant for this example but they exist :)
check_in_id = Sentry.capture_check_in(slug, :in_progress, monitor_config:)
start = Sentry.utc_now.to_i
result = :ok
begin
block.call
rescue StandardError
result = :error
raise
ensure
duration = Sentry.utc_now.to_i - start
Sentry.capture_check_in(
slug,
result,
check_in_id:,
duration:,
monitor_config:,
)
end
end
This does require setting a shutdown_timeout
that is lower than the grace period or timeout of the system that runs the ruby process - eg. I believe kubernetes will have a default grace period of 30 seconds.
Rails.application.configure do
...
# Wait 20 seconds for jobs to finish before shutting down. The kubernetes grace
# period is 30 seconds so forcing a shutdown after 20 seconds will allow good_job
# to handle the shutdown somewhat gracefully.
config.good_job.shutdown_timeout = 20
end
I also think my attempts to reproduce production-behaviour locally was flawed: I was using foreman
which has a default timeout of 5 seconds before sending SIGKILL
so until I set config.good_job.shutdown_timeout = 1
my ensure
block would not run.
Anyways - I believe this is working. Thanks for a great gem and your quick support :)