Betterment / delayed

a multi-threaded, SQL-driven ActiveJob backend used at Betterment to process millions of background jobs per day

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Handling duplicate jobs

synth opened this issue · comments

Hello! Thanks for taking up the torch and improving DelayedJob with all the goodies. I'm particularly grateful that this supports Ruby3 kwargs, whereas DelayedJob said they won't for some reason.

We are in the midst of migrating from DJ to Delayed and have found a sticking point with respect to our handling of "Duplicate Jobs". Duplicate jobs are separate jobs created by different parts of the system that do the same thing. For instance, let's say I have an expensive calculation that updates any time a certain frequent activity occurs in our app. So, different actions can result in the same job being created. While the job itself is idempotent, I don't want to fill my job queue with "duplicates" of the same job.

A long time ago, I had created a gist to handle this and it was recently turned into a gem by some other devs, which validates the need for Duplicate checking.

The way this gem works is that it expands the ActiveRecord model to have a "signature" column which can be indexed. It will try to infer a signature for a given job or it can be explicitly defined with custom logic for each individual job. The signature is then utilized by a "standard" DJ plugin.

Now that we are migrating to Delayed gem, I'd like to build support for this functionality into Delayed. The issue I found is that the Delayed gem seems to only include the Delayed::Job class (the ActiveRecord) when Rails::Engine is not defined:

delayed/lib/delayed.rb

Lines 18 to 23 in 66125d1

if defined?(Rails::Engine)
require 'delayed/engine'
else
require 'active_record'
require_relative '../app/models/delayed/job'
end

This prevents the gem from hooking into the ActiveRecord model without some fancy require logic. Is there a reason for this? Or is there another way to hook into the ActiveRecord class to extend functionality?

I see the above code was created via this commit and says that Delayed::Job should autoload. However, it's not present during Duplicate checking gem code. I can see that the load paths doesn't have /app/models at the time of the next gem, but does at the time my Rails console loads. So, I'm wondering if/how its possible to get the load path updated earlier or before this next gem loads

Thanks!

Never mind! I realized this can be sorted by having the sub-gem hook into Railties! Sorry for the bother. If anyone runs into a similar issue, you can see how I handled it here: noesya/delayed_job_prevent_duplicate@f867c74

Thanks for reaching out, and no worries! I'm glad you were able to get the plugin to hook into Rails' load order. The need for after_initialize stems from the change to the Zeitwerk autoloader, which disallows loading anything in the app/ folder during initialization, which specifically makes the Delayed::Job.include(...) call hard to inject.

One thing that might simplify your workaround would be to use Zeitwerk's on_load to defer the include call to when the class is available:

Rails.autoloaders.main.on_load('Delayed::Job') do
  Delayed::Job.include(DelayedDuplicatePreventionPlugin::SignatureConcern)
end

Of course, this only works if it doesn't need to be compatible with code running Rails' Classic autoloader, so you'd need to do something to detect what Rails.autloaders.main returns.

Alternatively, you could try doing everything in a before_enqueue hook in the plugin you've defined:

class DelayedDuplicatePreventionPlugin < Delayed::Plugin
  callbacks do |lifecycle|
    lifecycle.before(:enqueue) do |job|
      raise JobAlreadyEnqueued if identical_job_already_enqueued?(job)
    end
  end
end

I don't think there is any way to block an enqueue other than raiseing, so you'd need to change your enqueue logic to handle the possibility of such an exception.

FWIW, we take a very different approach to solving the "duplicate enqueued jobs" problem, and rely entirely on idempotency during execution (something that has to be uniquely defined per job / business operation). This ensures that if two identical jobs run, the second will either no-op or fail. (It also avoids possible race conditions in which two jobs with matching signatures both pass this valid? check at the same time, before one or the other has had a chance to enqueue.)

I haven't directly compared the before-enqueue vs after-pickup strategies in terms of performance, but I suspect that you'd find that the after-pickup strategy results in fewer overall queries against the DJ table. It also ensures that the queue still works as expected if a developer really does intend for a type of job to be run multiple times against the same entity. (Just my two cents. 😄)

Thanks for the feedback and guidance! I will review.

For us, de-duplicating jobs isn't so much about idempotency. We are de-duping jobs that already are idempotent. For instance, consider a job that does an expensive calculation. No matter how many times that job is run, given the same context, it will result in the same output of that calculation and thus is idempotent. If we didn't de-dupe them, then this expensive calculation if run many times could clog our queues. Of course, you might say, well then "noop". But the logic to determine whether a no-op should occur is either expensive at runtime or complex to manage a state.

An example here is a report calculation. In a contrived example, consider a business social network where you get points for lots of different activities (commenting, creating posts, reacting, etc). There is a report interface which shows aggregate counts of points by different attributes like department, job title, team, manager, etc. This is a complex query and assembly of information that happens regularly any time a new activity occurs.

This calculation is idempotent yet expensive to run. If we didn't de-dupe them from the job queue, then our queue would be full of the same job every time there was a new activity, when really it only needs to run once, until the next new activity. Even if we were able to track a noop, it would still result in a lot of jobs that do nothing. And to track noop, we'd have to build a system that makes the report "stale", which just seems extra when we can simply rely on a database index of a "signature" to handle the dupe checking and race conditions for us.

Excellent context, thank you. It reminds me of the idea of "singleton" jobs.