Duplicate change events after confirm timeout and restart

Question

Duplicate change events after confirm timeout and restart

mjq opened this issue 10 years ago · comments

Here's the relevant code block to follow along.

In Feed.prototype.confirm, a request is made to check if the DB is reachable, and a timeout is set to detect a slow response from Couch. If the timeout is hit, the Feed is killed (self.die is called). But, the request object isn't destroyed. That means that if Couch responds after the timeout, the happy path callback db_response still gets called.

Normally, this isn't that noticeable, since the Feed object is dead and everything short-circuits. But, if the user called restart on the feed in response to the error, dead will be false, and the Feed ends up getting set up twice (once in response to the timed-out request, and once due to restart(). This results in every change event getting called twice.

The fix would seem to be adding destroy_req(req); here before dieing. I haven't figured out how to write a test for this though. Any ideas?

Jarrett Cruger · Answer 1 · Fri Oct 24 2014 06:53:17 GMT+0800 (China Standard Time)

@mjq do you have any sample code that reproduces this? thats the best place to start for a test

Matt Quinn · Answer 2 · Fri Oct 24 2014 07:19:47 GMT+0800 (China Standard Time)

Sorry, sure. Simplified, it's:

var follow = require('follow');
var db = '...';

var feed = new follow.Feed({db: db, include_docs: true});

feed.on('change', function(change) {
  console.log('got change %d', change.seq);
});

feed.on('error', function(err) {
  console.log('got error %s, restarting in 5s', err.message);
  setTimeout(function() {
    console.log('restarting');
    feed.restart();
  }, 5000);
});

feed.start();

Normally, the logs would look like

got change 5
got change 6
got change 7

But, if the first attempt to reach the database times out but responds shortly after, you'll see

got error "Timeout confirming database: <db name>", restarting in 5s
restarting
got change 5
got change 5
got change 6
got change 6
got change 7
got change 7

Jarrett Cruger · Answer 3 · Sat Oct 25 2014 05:04:36 GMT+0800 (China Standard Time)

@mjq this is fascinating, I've never seen this happen. Destroy_req, should be called by the die function but it seems like there is a race condition leaving two requests? Ill have to dig deeper on this when i have a minute

Matt Quinn · Answer 4 · Sat Oct 25 2014 05:21:08 GMT+0800 (China Standard Time)

@jcrugzz die destroys self.pending.request, but the request in confirm is a local variable, so if it isn't destroyed in confirm, nothing will (or so it seems to me).

A simpler bug to test, repro and fix may just be:

request in confirm takes longer than the timeout, but
db_response is called anyway (even though the timeout killed the feed).

Since db_response only applies to the success case, that alone is weird/wrong behaviour, and just by fixing that (by e.g. destroying the request in the timeout fn), it should prevent the double-listener stuff.

re: race conditions: We've got a single process simultaneously following an ever-changing set of a few thousand databases (with all those databases on the same CouchDB box). So, when requests to that box start stalling... well, if there's a race condition to be found, we'll find it, heh.

I'm giving this patch a trial by fire right now, but I don't know how long it will take for us to trigger the bug again.

Jarrett Cruger · Answer 5 · Sun Oct 26 2014 23:00:08 GMT+0800 (China Standard Time)

@mjq gotcha, this is before it is piped into the changes-stream. Let me know if you can reproduce that but that looks like a valid fix. Super edge case but I can see the potential for it happening.

Sergey Belov · Answer 6 · Thu Jun 04 2015 00:01:40 GMT+0800 (China Standard Time)

@mjq @jcrugzz Are you going to fix this?

carrotalan · Answer 7 · Mon Feb 19 2018 03:36:13 GMT+0800 (China Standard Time)

+1 - This is still an issue