Scheduled jobs fail (as of Sidekiq 2.12.1)

Question

Scheduled jobs fail (as of Sidekiq 2.12.1)

deckchairhq opened this issue 11 years ago · comments

Originally this issue: #11 the information contained within that pull is inaccurate and misleading so starting a new issue...

Scheduled jobs cause an error with the sidekiq-middleware. This is potentially related to a recent change in Sidekiq ( sidekiq/sidekiq@c7828f1 ) whereby Scheduled and retried jobs now call middleware.

Deckchair · Answer 1 · Fri Jun 07 2013 23:28:01 GMT+0800 (China Standard Time)

More details about this issue....

The following error:

Jun  7 14:44:11 ip-10-166-4-155 sidekiq: 2013-06-07T14:44:11Z 4320 TID-ov9e1g10o ERROR: undefined method `get_sidekiq_options' for "Deckchair::Jobs::ImportHttp":String
Jun  7 14:44:11 ip-10-166-4-155 sidekiq: 2013-06-07T14:44:11Z 4320 TID-ov9e1g10o ERROR: /root/.rbenv/versions/2.0.0-p0/lib/ruby/gems/2.0.0/bundler/gems/sidekiq-unique-jobs-c10a44a73f8c/lib/sidekiq-unique-jobs/middleware/client/unique_jobs.rb:9:in `call'

...is seen using both this gem and sidekiq-unique-jobs: https://github.com/form26/sidekiq-unique-jobs

I am scheduling jobs within a running job, and the 'unique job client middleware' receives a string containing the classname instead of the actual class reference. This breaks when get_sidekiq_options is called.

I'm going to hazard a guess at the cause:

In version 2.5.4 of Sidekiq @karlfreeman added the following feature:

Sidekiq::Client.push now accepts the worker class as a string so the Sidekiq client does not have to load your worker classes at all. [#524]

https://github.com/mperham/sidekiq/blob/master/Changes.md#254
sidekiq/sidekiq@d9fd031

In version 2.12.1 of Sidekiq @dimko added the following feature:

Scheduled and Retry jobs now use Sidekiq::Client to push jobs onto the queue, so they use client middleware. [dimko, #948]

https://github.com/mperham/sidekiq/blob/master/Changes.md#2121
sidekiq/sidekiq@c7828f1

I assume this latest change has exposed us to classnames as strings, something that Sidekiq itself catered for in version 2.5.4.

I'm wondering whether we should constantize if we receive a String or whether Sidekiq core should constantize before middleware is invoked (thus not breaking all the middleware libraries)?

I've not traced the whole journey yet due to time constraints, so if anyone else can at least add concrete confirmation that this is the cause of the issue that would be grand.

Felix Holmgren · Answer 2 · Mon Jun 10 2013 15:53:03 GMT+0800 (China Standard Time)

I'm seeing this issue as well and need to resolve it asap. Does anyone have a PR in the works? If not, I might jump in a try to submit one.

Josh Ellithorpe · Answer 3 · Wed Jun 26 2013 03:57:49 GMT+0800 (China Standard Time)

This fix alone is not enough. If you are using manual lock expiry scheduled jobs don't work. It adds the uniqueness key when you first schedule the item, then when sidekiq moves it from the scheduled queue to the processing queue the uniqueness check stops it from running....

Anyone have a workaround? For async jobs everything is fine, but anything touching the scheduled queue is failing miserably.

Dmitry Krasnoukhov · Answer 4 · Wed Jun 26 2013 05:25:33 GMT+0800 (China Standard Time)

@zquestz, could you please provide failing test for this?

Dmitry Krasnoukhov · Answer 5 · Wed Jun 26 2013 05:30:48 GMT+0800 (China Standard Time)

@zquestz BTW I'm running a bunch of Sidekiqs on production that are constantly processing a large (5M+) scheduled jobs queue and I've noticed that uniqueness might be failing on step when job is going from schedule to regular queue. I've been very busy past month so didn't created proper PR for sidekiq to fix this unfortunately

Josh Ellithorpe · Answer 6 · Wed Jun 26 2013 05:35:54 GMT+0800 (China Standard Time)

That is exactly the issue I am seeing. I will try to provide you some specs =)

Dmitry Krasnoukhov · Answer 7 · Wed Jun 26 2013 05:39:12 GMT+0800 (China Standard Time)

The point it that I don't think this should be considered in this middleware. More seems like Sidekiq bug.
You might want to check how I fixed this for us just to make things work: theoldreader/sidekiq@5a6a2d9

Josh Ellithorpe · Answer 8 · Wed Jun 26 2013 06:42:01 GMT+0800 (China Standard Time)

Took a stab at a test. It passes when uniqueness options are turned off, and fails when set to :all. Let me know if there is a better way to setup/test this behavior.

50e723c

Darcy Laycock · Answer 9 · Thu Sep 05 2013 15:39:13 GMT+0800 (China Standard Time)

Yeh, we've been getting the same thing in ours.

I'm working on the latest version, and the issue we've seen can be tied down to a basic thing: The current version of the sidekiq scheduler uses this code:

# Get the next item in the queue if it's score (time to execute) is <= now.
# We need to go through the list one at a time to reduce the risk of something
# going wrong between the time jobs are popped from the scheduled queue and when
# they are pushed onto a work queue and losing the jobs.
while message = conn.zrangebyscore(sorted_set, '-inf', now, :limit => [0, 1]).first do

  # Pop item off the queue and add it to the work queue. If the job can't be popped from
  # the queue, it's because another process already popped it so we can move on to the
  # next one.
  if conn.zrem(sorted_set, message)
    Sidekiq::Client.push(Sidekiq.load_json(message))
    logger.debug { "enqueued #{sorted_set}: #{message}" }
  end
end

The issue becomes this:

When we've scheduled the item, we've set a lock for the hash without the at key.
Sidekiq adds scheduled items to the queue.
Every so often, sidekiq will iterate items in the past, one by one, requeueing them using the client.
This is implemented as middleware on client - Hence, we've already set the lock when we've scheduled the
item - and we check this when moving the job from the scheduled sorted set to the normal queue.
Since it's already got a lock, it will just drop the message instead, never actually appending it to the queue.

Thus, perform_at is basically saying "delay adding this to the queue until this future point in time" - Note that it doesn't say "Add to the queue and process at this point in time".

I'm working on a pull request to approach this, which involves adding unique lock identifier and passing it into the actual object. Instead of simply checking for the key to be set, it will check that the key is set and does not match.

Dmitry Krasnoukhov · Answer 10 · Mon Sep 09 2013 00:29:38 GMT+0800 (China Standard Time)

@zquestz I think the point is that you need to manage locks manually to have ability to re-schedule job inside the same job. See README for example.