Betterment / delayed

a multi-threaded, SQL-driven ActiveJob backend used at Betterment to process millions of background jobs per day

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Configuring Exception Notification on Failure

tomrossi7 opened this issue · comments

I'm trying to configure basic exception notification, but can't seem to get the event to trigger. It triggers if I subscribe to "delayed.job.error", just not "delayed.job.failure". I've confirmed that the job actually fails according to the logs, does anyone see anything I may be doing wrong here?

ActiveSupport::Notifications.subscribe("delayed.job.failure") do |*args|
  event = ActiveSupport::Notifications::Event.new(*args)
  job = event.job
  ExceptionNotifier.notify_exception(job.last_error, data: { handler: job.handler })
end

I'm definitely seeing the event trigger, but if the callback block raises its own exception (due to a code error) I suspect that the worker might swallow the error and prevent you from seeing what's happening.

I tried your callback block in some test code, and when I threw a binding.irb in there I found that event.job raises an undefined method error -- I think the way to get at the job is event.payload[:job].

It's worth noting, however, that the delayed.job.failure callback was really intended more for instrumentation/monitoring. (We forward all of these events along to our StatsD emitter.) For exception notifications, you might actually get better information by listening for delayed.job.run and checking for exception payloads:

ActiveSupport::Notifications.subscribe("delayed.job.run") do |*args|
  payload = ActiveSupport::Notifications::Event.new(*args).payload

  if payload[:exception_object]
    ExceptionNotifier.notify_exception(payload[:exception_object], data: { handler: payload[:job].handler })
  end
end

Thanks @smudge! What is the advantage of listening to delayed.job.run vs delayed.job.failure? Aren't they both for instrumentation/monitoring? We may just tweak to send us an email when a delayed job fails so we know an issue has occurred.

Good question! So, delayed.job.run is the one that actually wraps the execution of the job and will bubble out real Exception objects (via payload[:exception_object]), as well as code timings, etc. It's more or less equivalent to the :invoke_job hook (in the pre-existing DJ lifecycle/plugin framework), which, for example, Sentry uses in its own DelayedJob plugin.

The delayed.job.failure hook, on the other hand, does not give you access to a real Exception instance -- all you get is job.last_error, which is a String -- and as such may not play as nicely with libraries that expect a true Exception instance. This is because it doesn't wrap any code execution, and was added more as a convenience for wiring up simple "increment" metrics.

I'm taking all of this as feedback BTW, because maybe we can do something to either make the usage more clear, or forward the Exception instance along to the failure hook, so that it can be used more meaningfully outside of metrics.

Thanks for all your doing!