Automattic / kue

Kue is a priority job queue backed by redis, built for node.js.

Home Page:http://automattic.github.io/kue

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Jobs stuck in inactive state

mikemoser opened this issue · comments

Jobs get stuck in the inactive state fairly often for us. We noticed that the length of q:[type]:jobs is zero, even when there are inactive jobs of that type, so when getJob calls blpop, there is nothing to process.

It looks like this gets set when a job is saved and the state is set to inactive using lpush q:[type]:jobs 1. We're wondering if this is failing in some cases and once the count is off, jobs remain unprocessed.

Has anyone else seen this issue?

Yes, I am having the same problem... do you have a patch?

We do not have a patch yet - as we're not sure of the root cause. Right now we are wondering if there is an intermittent problem with the lpush call below.

Job.prototype.state = function(state){
  var client = this.client;
  this.removeState();
  this._state = state;
  this.set('state', state);
  client.zadd('q:jobs', this._priority, this.id);
  client.zadd('q:jobs:' + state, this._priority, this.id);
  client.zadd('q:jobs:' + this.type + ':' + state, this._priority, this.id);
  //increase available jobs, used by Worker#getJob()
  if ('inactive' == state) client.lpush('q:' + this.type + ':jobs', 1);
  return this;
};

We have added some diagnostics in job.js ctor to log errors for the client and waiting to repro:

this.client.on('error', function (err) {
  console.log('redis job client error ' + err);
});

This may not be the cause, if anyone else has ideas or a patch we would love to know. They get stuck often so I was surprised that more people have not run into this.

Is very extrange... I am trying to find the cause...

When I call: jobs.inactive() I got the Jobs IDs# [ '147', '149', '144', '164', '168', '172', '176' ]

But for some reason jobs.process() don't see them not process them.

I am able to reproduce the problem when stoping a worker... tasks tacked by the worker will be stuck in inactive state.

if I run:

jobs.process 'email', 4, (job, done) ->

1 task remains inconclusive in active state and 3 more remains stuck on inactive state forever.

If I reactive the worker all the other pending task are been processed but the ones I mention are getting stuck forever.

We have just seen this issue as well (job stuck in the inactive state).

@sebicas do you have more details on your repro? Not sure what it means to be "tacked" by a worker while still inactive?

@mikemoser not sure if "tacked" was the right word... what I tried to said if that for some reason the amount of stuck tasks is some way related to the number of simultaneous tasks indicated on the job.

For example if I do:

jobs.process 'email', 4, (job, done) ->

4 tasks will be stuck

jobs.process 'email', 6, (job, done) ->

6 tasks will be stuck

and so on...

We use kue to manage somewhere between 1k-20k jobs per day and see the same problems. For us sometimes it's once a week. Other times multiple per day.

Unfortunately, the root cause of these issues are likely fundamental to the way kue is written - since changes are applied serially in kue, not as an atomic transaction, any little glitch / crash can cause the items in a job to get partially applied, leading to the need to manually repair "broken" jobs.

We're at the stage where we're deciding whether to rewrite the innards of kue to be more reliable, or whether to move to something else. Any thoughts would be appreciated.

Unfortunately we are in the same situation as @dfoody :(

should be pretty trivial to make things atomic, I dont have time to look at it right now but even at worst we could use a little lua script. Though even if this portion is fully atomic there's always the chance of something being stuck, if the process is killed etc.. really I think the bigger problem is that we need to recover from half-processed jobs etc

I agree, besides making things atomic... is process is killed in the middle of jobs execution, that causes the job to get stuck... @visionmedia any suggestions in how to solve that?

off the top of my head I can't think of any way really to differentiate between an active job and an active job whose's proc died. We could possibly tie PIDs into the whole thing, or alternatively just "timeout" those jobs, if it's been active for N minutes and it's not complete kill it and retry

FYI, there is a pull request here #105 (thanks @dfoody) that incorporates a watchdog to check/fail jobs that have run more than a configurable period of time. We have been using this code successfully in our project.

I do think they could be separated a bit, it's a pretty huge patch, I dont think some of that belongs in core and it takes more time to review really big pull-requests that have a larger scope

@dfoody just mention he still have stuck jobs, so I guess his patch didn't solve the problem completely.

There are really two separate issues here:

(1) What do you do with jobs that legitimately fail (this is where the watchdog enhancement I put in does work well - as long as you're sure that, when the watchdog fires, it's really failed and not just slow - so set your timeouts appropriately). The only alternative to really know if jobs have died or not is to use something like zookeeper under the covers (which has a nice feature that locks can automatically be released when a process dies).

(2) What happens when kue's data structures get corrupted. This is happening to us a lot right now, due to a combination of factors we believe: We're now doing calls across Amazon availability zones (Redis in a different AZ from Kue servers - increasing the latency between a series of redis requests) and we're now running significantly more kue servers than we were before. We think it's this combination of factors causing issues us to see the corruptions much more often. This is where moving to atomic redis operations (with appropriate use of 'watch') will hopefully help.

@dfoody - thanks for qualifying. To be clear, this issue represents (2) we have a very basic setup and we see the redis indexes as describe above get out of sync before a job is ever processes, so they just stay in the inactive state. It happens a lot, however we can not get a consistant repro. Does anyone have a repro?

@mikemoser given what you describe - does your process that queues the job quit soon after queuing it?
If so, that's almost certainly your issue. Not all redis operations are completed by the time the callback is called from kue typically, so you need to make sure the process doesn't die for at least a few seconds to give kue/redis enough time to finish doing all the operations that are part of queuing a job.

We've also seen the same issue, our queuing processes are live ones. Any workarounds? is this issue finally clarified?

@dfoody our worker process is always running, so kue should have all the time it needs to finish the operation of adding a job (e.g. call line 447 of kue/lib/jobs.js to increment the index by one for the new job). We are not able to get a consistant repro, so it's proving hard to fix, however we see it happen all the time. I want to reiterate this issue is about "new" jobs never getting out of the inactive sate, not jobs in process that get stuck. Those of you that said you've seen the same behavior is it the "new jobs stuck in inactive state and never getting processed?"

I've not seen the case on our side where jobs get stuck in inactive without something else happening around the same time (e.g. a crash, etc. that corrupts the data, AWS having "minor" issues like all of EBS going down. etc). But, when a job does get stuck, we have to manually correct things before it starts moving again.

That said, we're running off my fork, not the original (which has lots of QoS and reliability changes).

One other thing to try: Have you restarted Redis recently? We have seen that sometimes redis does need a restart and that fixes some things.

We're seeing similar behavior as well.

What we see is that new jobs are stuck in an inactive state.

We have concurrency set to 1, but have a cluster of 4 processes.

Looking at Redis, we currently have two 'inactive' jobs.

When a new job is created, the oldest of the two inactive jobs suddenly gets processed.

So, we have, essentially, the two newest jobs always stuck - until they're displaced by new jobs.

There seem to be two causes for new jobs to never get processed and stuck in the inactive state.

  1. Indexes are out of sync (e.g. my original post)
  2. BLPOP not responding when a new items are added

@edwardmsmith it sounds like your symptoms are related to #1. You can verify this by checking the if llen q:[type]:jobs is less than zcard q:jobs:[type]:inactive. We added a watchdog for each job type that checks to see if they are out of sync and corrects it by calling lpush q:[type]:jobs 1 for however many jobs are inactive and not in the key used by process job.

After correcting #1, we still noticed jobs stuck in the inactive state. It seems that BLPOP becomes unresponsive for certain job types and those jobs never process, even though the redis indexes look good. We don't have a high-volume of jobs for these types and our theory is that something goes wrong with with the redis connection, but it fails silently and BLPOP just remains blocking and doesn't process any more jobs of that type. We have to restart our worker process and it starts processing all the jobs properly. Has anyone seen BLPOP exhibit this behavior?

We're considering switching to lpop and adding a setTimeout to throttle the loop, however we'd prefer to keep BLOP and not add what is essentially a high-frequency polling solution.

This might help you.

Here's the rough set of steps we typically follow to repair various kue issues we see regularly:
(note that some of these apply only to our fork - that has the ability to do "locks" so jobs for - in our case - a single user are serialized (users are differentiated based on their email address).

Failed Jobs Not Showing
When there are failed jobs (the sidebar says non-zero), but none show in the list, use the follow this procedure to repair them:

redis-cli

zrange q:jobs:failed 0 -1

For each do hget q:job:NUM type until you find one that has 'type' null (or no 'type' field shows up)
Then hgetall q:job:NUM to see the data values for it.

If there is no 'data' json blob, you can't recover - just delete the job as follows:
hset q:job:NUM type bad
zrem q:jobs:QUEUE:failed NUM
(where QUEUE is specific queue the job was in - if you don't know which do this for each one)

That should make the jobs now appear.
Then go into the Kue UI and delete the 'bad' job.

If that doesn't work (e.g. it corrupts the failed queue again), here's how to manually delete a job:
zrem q:jobs NUM
zrem q:jobs:failed NUM
del q:job:NUM
del q:job:NUM:state
del q:job:NUM:log

Even if there is a 'data' json blob, other fields might be messed up. It's best to find out what type of job it is and who it applies to (via looking in the log files), do the above procedure and then kick off a new job (via the admin ui) to replace the corrupt one.

Jobs Staying in Queued
Sometimes, jobs will stay in queued and not be allocated to a worker even if one is available. But, as soon as another job is queued, one will go out of queued and get processed (but one or more will still be "stuck" in queued).

First, find the queue that's misbehaving.
The example below assumes QUEUE is it's name.

Find out how many jobs are queued:
llen q:QUEUE:jobs
zrange q:jobs:QUEUE:inactive 0 -1

There are two possible problems here:
The number doesn't match between these two commands.
The number matches and it's 0 for both, but a job is still showing in the UI
To solve these problems:

  1. Execute the following command as many times as is needed to make the numbers the same (e.g. if llen returns 0 and zrange returns 2 items, run it 2 times):
    lpush q:mail:jobs 1

  2. In this case (they show up in the UI and when you do zrange q:jobs:inactive 0 -1), for each job showing up in the UI but not showing up in the above commands, it could be that the job is actually in a different state in reality, or two entries are invalid. Here's how to check:
    hget q:job:NUM state

 if the state is inactive do the following, do the following in this order:
      zadd q:jobs:mail:inactive 0 NUM
      lpush q:mail:jobs 1

 If the sate is not inactive, then you should remove it from the inactive list:
      zrem q:jobs:inactive NUM

Jobs Staying in Staged
If jobs for a user stay in staged, and there are no other jobs for that user in inactive, active, or failed, this likely means that a previous job never released the lock correctly. Check if this is the case as follows (given the specific user's email):
get q:lockowners:EMAIL

Assuming this shows a job number, get that job's current state:
hget q:job:NUM state

If it's current state is complete, you just need to delete the job and that should get the queue flowing. You may also need to repair the staged queue if it's corrupt after deleting the job:
zrem q:jobs:staged NUM

If you can't get to the specific job, try clearing the completed queue.

If the current state of the job that has the lock is 'staged', then you should move that job directly to 'inactive' manually in the UI (since it already has the lock it can go ahead and be moved to execute).

  • Dan

On Dec 7, 2012, at 2:24 PM, Michael Moser notifications@github.com wrote:

There seem to be two causes for new jobs to never get processed and stuck in the inactive state.

Indexes are out of sync (e.g. my original post)
BLPOP not responding when a new items are added
@edwardmsmith it sounds like your symptoms are related to #1. You can verify this by checking the if llen q:[type]:jobs is less than zcard q:jobs:[type]:inactive. We added a watchdog for each job type that checks to see if they are out of sync and corrects it by calling lpush q:[type]:jobs 1 for however many jobs are inactive and not in the key used by process job.

After correcting #1, we still noticed jobs stuck in the inactive state. It seems that BLPOP becomes unresponsive for certain job types and those jobs never process, even though the redis indexes look good. We don't have a high-volume of jobs for these types and our theory is that something goes wrong with with the redis connection, but it fails silently and BLPOP just remains blocking and doesn't process any more jobs of that type. We have to restart our worker process and it starts processing all the jobs properly. Has anyone seen BLPOP exhibit this behavior?

We're considering switching to lpop and adding a setTimeout to throttle the loop, however we'd prefer to keep BLOP and not add what is essentially a high-frequency polling solution.


Reply to this email directly or view it on GitHub.

@mikemoser - Thanks for the reply - interestingly, I don't have a key (my job type is 'email') q:email:jobs at all:

> keys *email*
1) "q:jobs:email:active"
2) "q:jobs:email:failed"
3) "q:jobs:email:complete"
4) "q:jobs:email:inactive"

So I had two stuck jobs:

> zrange q:jobs:email:inactive 0 -1
1) "6063"
2) "6064"
> lpush q:email:jobs 1
(integer) 1
> lpush q:email:jobs 1
(integer) 1
> zrange q:jobs:email:inactive 0 -1
(empty list or set)
> 

So, that seems to have cleared out the stuck items for now.

@dfoody - Wow, thanks for that!

@edwardmsmith looks like your key was empty and it does seem that the indexes were our of sync. You can add a watchdog for each type to check this and correct it like we have.

@dfoody thanks for sharing - looks like y'all are having a lot of issues. We hope this is not a sign to come for us as we get more volume through kue. You state only 2 reasons for "Jobs Staying in Queued" however we have seen a third and where the numbers match on the indexes and they are greater than zero. This is where we just see the worker for that type sitting on the BLPOP command even through we are pushing new jobs to the key it's blocking on (e.g. lpush q:[type]:jobs 1). It really seems like BLPOP is just not responding when it should and never throwing and error. I'm not very experienced with redis in a high-volume senario, is BLPOP prone to this type of behavior? We are using OpenRedis to host, not sure if the extra network layers would effect this.

We've not seen issues with BLPOP.
But, we're on redis 2.4.x and it looks like OpenRedis is on 2.6.x (and, coincidentally, the latest 2.6 release has a fix for a BLPOP issue…)

We host redis ourselves, and it's entirely possible - if you're not local to your redis server - that that could be the cause of issues (though I've not looked at the underlying redis protocol layer to see how they implement it to know more concretely if that type of thing could be an issue - e.g. does it heartbeat the connection to detect failures, etc.).

On Dec 7, 2012, at 3:16 PM, Michael Moser notifications@github.com wrote:

@edwardmsmith looks like your key was empty and it does seem that the indexes were our of sync. You can add a watchdog for each type to check this and correct it like we have.

@dfoody thanks for sharing - looks like y'all are having a lot of issues. We hope this is not a sign to come for us as we get more volume through kue. You state only 2 reasons for "Jobs Staying in Queued" however we have seen a third and where the numbers match on the indexes and they are greater than zero. This is where we just see the worker for that type sitting on the BLPOP command even through we are pushing new jobs to the key it's blocking on (e.g. lpush q:[type]:jobs 1). It really seems like BLOP is just not responding when it should and never throwing and error. I'm not very experienced with redis in a high-volume senario, is BLPOP prone to this type of behavior? We are using OpenRedis to host, not sure if the extra network layers would effect this.


Reply to this email directly or view it on GitHub.

We're thinking about changing the kue/lib/queue/worker.js getJob() function to no longer use BLPOP and just use LPOP with a setTimeout. Here is a change that we've been testing locally. Any thoughts?

/**
 * Attempt to fetch the next job. 
 *
 * @param {Function} fn
 * @api private
 */

Worker.prototype.getJob = function(fn){
  var self = this;

  // alloc a client for this job type
  var client = clients[self.type]
    || (clients[self.type] = redis.createClient());

  // BLPOP indicates we have a new inactive job to process
  // client.blpop('q:' + self.type + ':jobs', 0, function(err, result) {
  //   self.zpop('q:jobs:' + self.type + ':inactive', function(err, id){
  //     if (err) return fn(err);
  //     if (!id) return fn();
  //     Job.get(id, fn);
  //   });
  // });

  client.lpop('q:' + self.type + ':jobs', function(err, result) {
    setTimeout(function () {
      self.zpop('q:jobs:' + self.type + ':inactive', function(err, id){
        if (err) return fn(err);
        if (!id) return fn();
        Job.get(id, fn);
      });
    }, result ?  0 : self.interval);
  });

Any news on this? Looking to run > 200k jobs/day and need sth. stable since it will be kinda impossible to handle errors/stuck jobs manually.

We have determined and fixed the BLPOP not responding. There were a few factors in play:

  1. We use OpenRedis and they had some logic that would kill idle connections (not considering blocking operations).
  2. We use a database index greater that 0 (e.g. the default on a connection) and the reconnect logic in kue did not consider the currently selected database.

So, the reason the BLPOP appeared unresponsive was b\c it had connected to the wrong database instance (e.g. back to index 0). We fixed this by:

  1. OpenRedis made a change to stop killing idle connection for blocking commands.
  2. We added some logic in Kue to ensure the correct database was selected before calling BLPOP again.

kue/lib/queue/worker.js getJob()

/**
 * Attempt to fetch the next job. 
 *
 * @param {Function} fn
 * @api private
 */

Worker.prototype.getJob = function(fn){
  var self = this;

  // alloc a client for this job type
  var client = clients[self.type]
    || (clients[self.type] = redis.createClient());

  // FIX: Ensure the correct database is selected
  // Note: We added selected_db when originally connecting 
  client.select(client.selected_db, function () {
    // BLPOP indicates we have a new inactive job to process
    client.blpop('q:' + self.type + ':jobs', 0, function(err) {
      // FIX: BLPOP may return an error and further commands should not be attempted.
      if (err) 
        return fn(err);

      self.zpop('q:jobs:' + self.type + ':inactive', function(err, id){
        if (err) return fn(err);
        if (!id) return fn();
        Job.get(id, fn);
      });
    });
  });

This is not the best place for this logic, I'm assuming we'd want to make the change in the core reconnect logic and ensure it does not execute the BLPOP until we're sure the database has been selected, however we have had this fix in place for several weeks and things are looking much better for us.

We continue to have a watchdog to fix the indexes, however we're observing to see if that issue is related to the selected db issue on reconnect as well.

@mikemoser BLPOP has a timeout parameter - just set to a nonzero value and create a long polling loop rather than use setTimeout. http://redis.io/commands/blpop. How's the watchdog fix been working for you? We're just now running into this in our app and I'm debating whether to fight this bug or move to another library while we're small and have time.

@mfrobben we're back to using the blpop as a long poll (see latest code above). We're still keeping an eye on the watchdog, we have some diagnostics that output to the log when the index was off and need correction and we see that fire from time to time, however we still see jobs not processing every once and we're not sure why just yet. We are using a non default database on redis and that was the reason BLOP seemed not to be working correctly and we are investigating if that's why the indexes are sometimes out of wack as well. Our kue is working much better now, however we still can't sleep well at night knowing that some jobs could get stuck at anytime.

Can you describe the problems you're seeing?

Same inactive jobs problem that the thread was first created with. A job will be listed as 'inactive' but the BLPOP either never fired or the server crashed, so the thread doesn't do anything until the next job comes in. Running lpush clears the queue, so I was thinking the watchdog makes enough sense as a solution.

We process shipments, email, push notifications, and payment transactions in background workers. Email and push are meh, but shipping and payment are ultra-critical codepath and having things invisibly 'stuck' creates unacceptable risk for our business IMO. I was under the impression that persistence and replay functionality of redis guaranteed we had recoverability for jobs after a crash, but given this bug, kue seems to be causing more issues in those situations than it solves. :|

This is our 3rd worker queue implementation in node. The only that has worked bug free so far is atomic database collection polling.

FWIW, and i'm sad to say this, we are giving up on kue because of these problems and moving to a postgres-backed queue.

@spollack My gut tells me the same. Ahh, the heartburn of being bleeding edge :\

We're taking a different route at the moment.
We're gradually rewriting some of the kue internals to be transactional.
But, it's not far enough along yet to be sharable.

  • Dan

On Feb 15, 2013, at 12:22 PM, Matthew Robben notifications@github.com wrote:

@spollack My gut tells me the same. Ahh, the heartburn of being bleeding edge :\


Reply to this email directly or view it on GitHub.

Is anyone else using a non-default db on redis with their kue? I ask b\c if you are still seeing indexes get out of sync and if you're using the default zero database, that means we can likely rule that out as the cause there. My current thinking is that they are not related, just interested to get some extra data from the group.

@spollack @mfrobben Me too, I'm trying to process >million jobs/day and unfortunately Kue is too unreliable.
I really do like Kue, and I hope in a future it will be more bulletproof.

@spollack you don't happen to have some solution to share?

@Jellyfrog i'm finishing the implementation now -- i'm happy to share as soon as i get a bit more testing on it.

Great!

@Jellyfrog i've created gists with the key pieces of our postgres-backed queue, so you can get a sense of the direction we went.

https://gist.github.com/spollack/5000509 -- db schema
https://gist.github.com/spollack/5000516 -- queue implementation
https://gist.github.com/spollack/5000522 -- queue tests

The code was written relying on a bunch of other pieces of our environment (including streamline, the gesundheit/any-db stack, etc.) and so likely will not be directly applicable without some adaptation.

Its still quite new, but so far is looking good in testing, and we'll be rolling it into production soon. Feedback welcome on the design.

If there is sufficient interest, i can work to factor out some of the internal dependencies and set it up as a real project/package.

Looks promising!
At least you should add it as a repo on your profile, feels bad to discuss it here!

@Jellyfrog good idea, i've set up a repo at https://github.com/spollack/pgqueue with the sample code (a bit cleaned up), and i've taken down the gists from before.

For anyone that's interested we've published our changes to make the internals of kue transactional (or at least as close as you can get with redis) to https://github.com/dfoody/kue/tree/QoS
A couple of notes on using it:

  • You need to use redis 2.6 (since it relies on lua scripting)
  • When you setup redis 2.6 we'd recommend that you set timeout to 0 and tcp-keepalive to 60
    In our experience with this so far it's been significantly more reliable than previously. We've not gotten a single corrupt job since moving to this codebase (though we've only been on it about a week).

We thought that we had this problem, but in fact did not understand the limits of Redis. I'm sorry that I cannot give reference to the actual documentation that states this, but if you have any other communication to Redis on a connection that has pub/sub, you may run into problems. Try creating a separate connection for pub/sub that is isolated from any other requests to redis.

We've been running with this approach for over a month now without a single problem.

Hope this message helps someone!
Cheers

Any updates further updates on this issue ?

Any updates here? My indexes are out of sync for no reason. Why are they out of sync?

I also ran into this problem.

bump

Anybody got a fix for this? Jobs are randomly stuck in queue.

@felixchan Hi Felix, this is a fundamental issue in the basic kue architecture - the updates to redis are not transactional, so any failure during the middle of an update leaves a half applied change to redis (the source of stuck jobs).

My branch of kue https://github.com/dfoody/kue/tree/QoS has a rewrite of the core of kue to be fully transactional. We've been running it in production (~50k jobs per day for ~6 months) with <5 total stuck jobs over that period (we used to get ~1-5 per day).

The downside is that - since it's so different than the baseline kue - it hasn't been merged in a very long time.

@dfoody Which versions of Node have you run your QoS with? I'm entertaining the idea of putting together 'kue2' which should probably have some of the capabilities you're already developed.

@rosskukulinski We still run node 0.6.21 in production. That said, I don't think there will be an issue using it with a newer version of node, other than the Kue UI (since express/connect have changed a lot). So, you can probably take the back-end of Kue from our fork, and the UI from the new version of kue and they should fit together fairly easily.

@dfoody cool, sounds like something for me in the next week or two. I also wonder how many people use the Kue UI. I feel it should really be a separate package and I would probably split it out.

@dfoody @rosskukulinski Keep us updated on what happens. I'd like to stick with Kue.

I realize that after installing redis on a different server AND disabling
snapshots/append file, everything works. No problems with this anymore.

On Sun, Oct 27, 2013 at 5:34 PM, Jacob Friis Saxberg <
notifications@github.com> wrote:

@dfoody https://github.com/dfoody @rosskukulinskihttps://github.com/rosskukulinskiKeep us updated on what happens. I'd like to stick with Kue.


Reply to this email directly or view it on GitHubhttps://github.com//issues/130#issuecomment-27183721
.

Hi @webjay - After some soul searching, we are going to be moving to a 'real' job system based on AMQP. For us, we've fallen in love with RabbitMQ and will be moving over to that in production very soon. As such, we've dropped development on the Kue package. The major lack of leadership and response from the maintainer is frustrating and the poor code quality is simply too much for a small team like us to take on.

Kue was a great as a platform to get us going (especially because we use Redis elsewhere), but as you can see from the PR list -- the project is dead.

cc: @dfoody

We have currently no problems in production (a million jobs/day) using kue on redis 2.6, some pulls are merged by me in our fork (some bugs+some improving pulls+no breaking API changes), and I'm gonna push it in a few days!
I'd like to hold the Kue project up & running. If anybody can help on writing test cases to automate and facilitate pull requests, we would have an active responsive fork of Kue!

@rosskukulinski Being on top of Redis, and no need for a RabbitMQ like thing is awesomeness of Kue! However one could write an AMQP backend adapter for Kue!? BUT; Kue is simple enough to be changed and modified, AND it should also remain simple!

I wrote a new job manager that cares about atomicity still being based on redis, main feature missing is "priority", but for many this is not important: https://github.com/OptimalBits/bull

We were using mongodb anyway, so switched to https://github.com/scttnlsn/monq last month and haven't had a problem since. Not as full featured as kue, but seems to be pretty solid.

news? :)

@v4l3r10 have you tested `0.7.x' version?

@behrad this is still happening for us as of 0.7.5. The most recent item in the queue gets stuck as inactive and requires another job to get queued before it will take over being the role of being the stuck job... Restarting the node.js app doesn't force the job to run.

this is still happening for us as of 0.7.5

Can you tell about the deterministic case or situation when it happens for you? or an example code which produces it?

Have you also tested the QOS branch?

We have not tested the QOS branch.

Our usage is a simple email job creating with jobs.create, and a worker with jobs.process -- the bug seemed to go away and then came back, it's hard to track down -- I've been watching this issue for months. I was using some of your earlier work on Kue (before your big PR got merged in) which seemed to resolve it, but the bug is back now.

My colleague and I have decided to remove Kue asap considering that the fact that it has no unit tests is really making it out to be a wildcard compared to the rest of our system that is tested.

Have you changed your deployment? (redis version? redis is local/remote to kue?,...)
Is it related to your input traffic!?
Some see that this happens when Redis hardly crashes, have you got any errors there when it happens!?
I am eager to resolve this issue. If you can help me find the point of happening, we can put a workaround there to help stuck jobs out of inactive state.
How often it happens?

and about tests, I haven't had enough time to write a complete suite, but I've created the issue to write them. any one interested can help to improve Kue.

@keyvanfatehi take a look att Bull job manager, https://github.com/OptimalBits/bull we have used it for months without any problems

@manast we are looking at bull later this week, thanks man, at first glance it looks great.

@behrad yes I would really like to find out as well, I will do my best to triage Kue a but before I rip it out of our stack and update here if I find anything.

Same here guys. I'm testing my sent emails, but some times i get a stuck job.

PS: 0.7.5

I'm planning to refactor Kue's built-in zpop implementation to redis's atomic lbpoprpush
I need your help testing it when finishes

@behrad that is not so easy because kue provides a priority queue, and lbpoproush is for standard lists...

I am thinking of using redis SORT...STORE... lists on the fly @manast

Another workaround on top of current version is to write a simple fixing monitor process for inactive job indexes. It can poll each 5secs, figure stuck jobs out and fix indexes if necessary

Until we can get into the atomic brpoplpush + sorted lists implementation, I've done minor changes to the sensitive core of worker#getJob and job.setState which is available in experimental branch. That code still doesn't contain any watchdogs to fix stuck jobs on the fly, to be able to determine if they still happen with this patch.

Would anybody interested please test it and let me know of results?

git clone -b experimental https://github.com/LearnBoost/kue.git

@brunocasado @keyvanfatehi @v4l3r10 @webjay @scriby ... anybody wanna test that branch!?

commented

I'm seeing exactly @keyvanfatehi's problem.

@behrad do I need to run the experimental branch on all kue clients or can I get away with doing it on just the ones creatings jobs / or processing jobs.

@ericz on both, however workers processing jobs are more important. BUT please don't run against experimental branch, since I've found a bug in it today, which I fixed and will commit into master branch as 0.7.6 until tomorrow. Please be ready to test in day or two and lemme know the results about 0.7.6 ;)

commented

KK, will wait on 0.7.6

On Sat, Apr 26, 2014 at 10:07 AM, Behrad notifications@github.com wrote:

@ericz https://github.com/ericz on both, however workers processing
jobs are more important. BUT please don't run against experimental branch,
since I've found a bug in it today, which I fixed and will commit into
master branch as 0.7.6 until tomorrow. Please be ready to test in day or
two and lemme know the results about 0.7.6 ;)


Reply to this email directly or view it on GitHubhttps://github.com//issues/130#issuecomment-41474309
.

510-691-3951
http://ericzhang.com

@ericz @keyvanfatehi I pushed 0.7.6 pre release into develop branch, Please test via
git clone -b develop https://github.com/LearnBoost/kue.git

I'm using this workaround to expire my stuck jobs:

jobs.process(key, concurrency, handlerWithExpiration(handler));

function handlerWithExpiration (handler) {
  return function (job, done) {
    console.log("Job %s:%s created", job.id, job.type);

    var expire = setTimeout(function () {
      done('Automatically Expired after 180s');
    }, 180000);

    handler(job, function (err) {
      clearTimeout(expire);
      done(err);
    });
  }
}

@ericsaboia R u still having stuck jobs with kue 0.7.7 !?

@behrad in some days i'll test again. This bug that you find is related with stuck?

EDIT: I'll try do some tests today. Maybe do a simple stress test with setTimeout
Very sorry for late answer.

FYI, the QoS branch has a built-in "watchdog" capability that can auto-restart jobs that are stuck.

Of course, the QoS branch isn't really subject to kue causing stuck jobs (since the core was rebuilt for atomicity and consistency versus the kue master) - but the watchdog itself can still help if your jobs themselves have internal issues that cause them to get stuck.

We are talking about jobs being stuck in inactive state, not active jobs. It seems you are talking about active jobs!?

I did a patch containing some small fixes to job state changes, and worker job poll. Those may reduce the probability of jobs being stuck in inactive which is reported to happen with unstable remote redis deployments/hostings.
That will help until we change to prpoplpush + some server-side lua in later versions.

@ericsaboia and about the active jobs, this is happening when your worker process encounters un-handled exceptions and exits abnormally. These as @dfoody said can be resolved with a simple watchdog process and removed, or we can add a job TTL implementation into Kue.

@behrad I know that the issue is about inactive state, but sometimes, when my jobs get stuck in an active statue, all other inactive jobs of the same type get stuck as well due to concurency limit. When it happens, the inactive jobs are still not processed even if I remove the active jobs that had been stucked.

@behrad It would be awesome to see the TTL's implementation into Kue! I have hundreds of thousands jobs being processed everyday, and a dozen of developers creating new handlers all the time. So it is almost impossible to handle all exceptions to call the done callback.

@ericsaboia This problem is by your app design. You should path them a convention or boilerplate code to develop robust workers. you can use nodejs domains or uncaughtException events or error event in EventEmitter to properly handle errors in Nodejs

To cure jobs stuck w/o calling "done" -- if you are using promises (or bluebird, at least -- haven't checked if this is promise spec compliant), you can wrap like:

jobs.constructor.prototype.processJob = (name, proc)->
  jobs.process(name, (job, done)->
    Promise.method(proc)(job).nodeify(done)
  )

Any progress on this issue ? i am using kue 0.8.1 and for me few of the jobs just stuck in active mode.

@ankitpatial

  1. This issue relates to jobs stuck in inactive mode, not active
  2. Your worker's code is responsible for proper error handling to ensure calling done otherwise jobs will stay in active state
  3. If you think your problem is something else please create a new issue providing details.

@behrad thanks for quick reply, i see few task just freeze in active queue, there is no exception i see in app log, all seems fine. Not seems my code issue, but i will give it a try, will debug my code for any possible hidden crash.

Ensure your done is not swallowed by any errors, or your logical code path. If you found something, please create a new issue.

@shaunc that (promises) does not help with handling client's async operation errors.

Just a quick question. Could this be related to an underlying network problem? This may be completely unrelated, but we've recently discovered that EC2 discards some TCP traffic due to a Xen issue.

@Climax777 Yes, your redis connection may be interrupted in the middle of anything...

@behrad Take a look at my blog post concerning the dropped TCP packets. This may help some deployments with issues. blog post

I am using kue@0.8.3
my job maker and worker are the same node process.
job maker will generate job in rand(0, 10) seconds, but I find sometimes if the delay time is too short, such as less than 1 sec, the worker will not be invoked immediately. From the UI tool I can see that the stuck jobs are in Queued column, inactive state. when I restart my node process all the stuck jobs will be processed at one time.

By using the following method I can work around this problem:

  1. q.promote(1000); //check every second
  2. change my job delay time to rand(2, 10)
    after this, everything seems to be ok.

But still, I found another problem:
start my kue process and then restart redis-server. and then any new generated job will be stuck until I restart my kue process.

@ganziganzi promote should have nothing to do with being stuck in inactive state. Your workers active jobs may be stuck! Please check if you have stuck active jobs first. they won't allow other inactive jobs to come in workers. I should first understand what exactly is your problem, then you should provide me your code snippet which produces the same results as you said, so that I can debug it