buildbot / sandbox

ticket migration sandbox

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Latent build slaves shut down uncleanly and get forgotten by the master

bb-bot opened this issue · comments

This ticket is a migrated Trac ticket 1780

People contributed to the original ticket: @exarkun (commenter), @djmitche (commenter), extremoburo@... (commenter), @tomwardill (watcher), @jacobian (commenter, reporter), john.carr@... (watcher), exarkun (watcher)


Occasionally when the master shuts down a latent buildslave it'll fail weirdly, and the master decides that the latent build slave is broken and never tries to reboot it.

Unfortunately I don't have a lot of insight into what's actually happening, but I'll provide as much detail as a I can:

The buildmaster is http://buildbot.djangoproject.com/. All the code running there lives at https://github.com/jacobian/django-buildmaster, and you can see the specific latent buildslave implementation at https://github.com/jacobian/django-buildmaster/blob/master/djangobotcfg/rsc_slave.py.

Here's what I see in the logs when this error occurs:

2011-01-26 10:22:41-0800 [-] disconnecting old slave bs1.jacobian.org now
2011-01-26 10:22:41-0800 [-] waiting for slave to finish disconnecting
2011-01-26 10:22:41-0800 [-] [[DjangoCloudserversBuildSlave]] bs1.jacobian.org deleting instance 572258
2011-01-26 10:22:41-0800 [Broker,2,204.232.209.196] [[BuildSlave]].detached(bs1.jacobian.org)
2011-01-26 10:22:45-0800 [Broker,3,204.232.209.196] slave 'bs1.jacobian.org' attaching from I[[Pv4Address]](TCP, '204.232.209.196', 53732)
2011-01-26 10:22:45-0800 [Broker,3,204.232.209.196] Slave bs1.jacobian.org received connection while not trying to substantiate.  Disconnecting.
2011-01-26 10:22:45-0800 [Broker,3,204.232.209.196] waiting for slave to finish disconnecting
2011-01-26 10:22:45-0800 [Broker,3,204.232.209.196] Peer will receive following PB traceback:
2011-01-26 10:22:45-0800 [Broker,3,204.232.209.196] Unhandled Error
        Traceback (most recent call last):
        Failure: exceptions.[[RuntimeError]]: Slave bs1.jacobian.org received connection while not trying to substantiate.  Disconnecting.
        
2011-01-26 10:22:45-0800 [-] [[DjangoCloudserversBuildSlave]] bs1.jacobian.org deleted instance 572258

The lines "Django[[CloudserversBuildSlave]] bs1.jacobian.org deleting instance 572258" and "Django[[CloudserversBuildSlave]] bs1.jacobian.org deleted instance 572258" are coming from my code; the rest are logged by Buildbot itself.

The problem isn't the connection error: the slave gets shut down just a few seconds later. But when this happens the master decides the slave is somehow broken and never boots another instance. The only way to get it working again is to restart the buildmaster.

That's all I know for sure, but here's my speculation on what I think might be happening: it appears that the build master disconnects my latent slave, then calls stop_instance() to shut it down. The master then detaches the slave. If the shutdown hasn't finished quickly enough, though, it looks like the slave tries to reconnect -- it's been kicked off by the master, and not yet killed as part of the shutdown process. So it looks like the master freaks out and decides that the slave's misbehaving and never tries to boot it again.

It seems that the master should just ignore connections from the slave while it's trying to unsubstantiate the slave. Otherwise unless the slave shuts down immediately upon the stop_instance() call it seems like this'll happen again and again.