The buildbot server no longer start new builds
vstinner opened this issue · comments
$ date
Fri Aug 25 09:51:55 UTC 2023
$ grep 'starting build ' master/twistd.log
# empty
$ grep 'starting build ' master/twistd.log.1|tail
2023-08-24 16:38:47+0000 [-] starting build <Build AMD64 Fedora Stable Clang Installed 3.x number:4304 results:success>.. pinging the worker <WorkerForBuilder builder='AMD64 Fedora Stable Clang Installed 3.x' worker='cstratak-fedora-stable-x86_64' state=BUILDING>
2023-08-24 16:38:47+0000 [-] starting build <Build AMD64 Fedora Stable LTO + PGO 3.x number:4551 results:success>.. pinging the worker <WorkerForBuilder builder='AMD64 Fedora Stable LTO + PGO 3.x' worker='cstratak-fedora-stable-x86_64' state=BUILDING>
2023-08-24 16:45:35+0000 [-] starting build <Build s390x Fedora Rawhide 3.x number:None results:success> using worker <WorkerForBuilder builder='s390x Fedora Rawhide 3.x' worker='edelsohn-fedora-rawhide-z' state=AVAILABLE>
2023-08-24 16:45:35+0000 [-] starting build <Build s390x Fedora Rawhide Clang 3.x number:None results:success> using worker <WorkerForBuilder builder='s390x Fedora Rawhide Clang 3.x' worker='edelsohn-fedora-rawhide-z' state=AVAILABLE>
2023-08-24 16:45:36+0000 [-] starting build <Build s390x Fedora Rawhide Clang Installed 3.x number:None results:success> using worker <WorkerForBuilder builder='s390x Fedora Rawhide Clang Installed 3.x' worker='edelsohn-fedora-rawhide-z' state=AVAILABLE>
2023-08-24 16:45:36+0000 [-] starting build <Build s390x Fedora Rawhide 3.x number:3548 results:success>.. pinging the worker <WorkerForBuilder builder='s390x Fedora Rawhide 3.x' worker='edelsohn-fedora-rawhide-z' state=BUILDING>
2023-08-24 16:45:36+0000 [-] starting build <Build s390x Fedora Rawhide LTO + PGO 3.x number:None results:success> using worker <WorkerForBuilder builder='s390x Fedora Rawhide LTO + PGO 3.x' worker='edelsohn-fedora-rawhide-z' state=AVAILABLE>
2023-08-24 16:45:36+0000 [-] starting build <Build s390x Fedora Rawhide Clang 3.x number:3538 results:success>.. pinging the worker <WorkerForBuilder builder='s390x Fedora Rawhide Clang 3.x' worker='edelsohn-fedora-rawhide-z' state=BUILDING>
2023-08-24 16:45:37+0000 [-] starting build <Build s390x Fedora Rawhide Clang Installed 3.x number:3692 results:success>.. pinging the worker <WorkerForBuilder builder='s390x Fedora Rawhide Clang Installed 3.x' worker='edelsohn-fedora-rawhide-z' state=BUILDING>
2023-08-24 16:45:37+0000 [-] starting build <Build s390x Fedora Rawhide LTO + PGO 3.x number:3627 results:success>.. pinging the worker <WorkerForBuilder builder='s390x Fedora Rawhide LTO + PGO 3.x' worker='edelsohn-fedora-rawhide-z' state=BUILDING>
Yesterday, a config change was made:
commit 6fed7ed896f67054c89428f0ef048f9f3a6b0098 (HEAD -> main, origin/main, origin/HEAD)
Author: Zachary Ware <zach@python.org>
Date: Thu Aug 24 11:38:26 2023 -0500
Add issue template for worker addition (#383)
I suppose that the server was restarted to take in account this new config.
I restart the server to try to repair the issue:
buildbot@buildbot:/srv/buildbot$ make stop-master
(...)
2023-08-25 09:54:04+0000 [-] Ignoring SIGTERM, master is already shutting down.
I have to kill the server :-(
buildbot@buildbot:/srv/buildbot$ ps ax|grep buildbot
1015538 pts/0 S 0:00 sudo -u buildbot -s -H
1016928 pts/0 S+ 0:00 grep buildbot
3484334 ? Sl 894:05 /srv/buildbot/venv/bin/python3.9 -c from twisted.scripts import twistd; twistd.run() --no_save --logfile=twistd.log --python=buildbot.tac
buildbot@buildbot:/srv/buildbot$ kill -9 3484334
buildbot@buildbot:/srv/buildbot$ ps ax|grep buildbot
1015538 pts/0 S 0:00 sudo -u buildbot -s -H
1017004 pts/0 S+ 0:00 grep buildbot
I restarted the server:
$ make start-master
I see at least one issue: make stop-master
is usually unable to stop the server. It seems like yesterday, deploying a new config triggered make stop-master
. The server got SIGINT, stop starting new builds, but the server didn't stop neither, and so was not restarted properly.
Logs when the server was stopped or restarted:
buildbot@buildbot:/srv/buildbot$ grep -E 'Starting BuildMaster|SIGTERM|shutting' master/twistd.*
master/twistd.log.8:2023-08-22 23:16:31+0000 [-] Not shutting down, there are 1 builds running
master/twistd.log.8:2023-08-22 23:21:33+0000 [-] Starting BuildMaster -- buildbot.version: 3.8.0
master/twistd.log:2023-08-25 09:54:04+0000 [-] Ignoring SIGTERM, master is already shutting down.
master/twistd.log:2023-08-25 09:55:46+0000 [-] Starting BuildMaster -- buildbot.version: 3.8.0
The buildbot server no longer start new builds
@erlend-aasland saw this issue on a PR: python/cpython#108392
I also noticed that some buildbot were marked as failing on test_peg_generator, whereas I fixed it yesterday, and no new job was scheduled on related builders.
Examples of test_peg_generator failures:
- https://buildbot.python.org/all/#/builders/729/builds/5302
- https://buildbot.python.org/all/#/builders/58/builds/5489
Since I just restarted the server, new builds should now be scheduled again, it should be fine.
@zware @pablogsal @ambv: It's unclear how me how this issue occurs :-( It seems like the issue occurs when:
- I made a config change
- I apply the config change manually
- A cron job reapplies the change again
For me, one known issue is that make stop-master
doesn't work :-( It sends a gentle SIGINT to Twiter who says "sure, I'm going to stop", but then Twisted hangs and never stops. In (3), I almost end up by using the even kill -9 pid
. Honestly, I would suggest to do that in make stop-master
as well. The command must do what it says, and not leave the buildbot server is a broken state: running but stuck.
Maybe the bug is that sometimes, the server is blocked in that stop: run but don't schedule new jobs since "it's being shut down".
I wrote PR #396 to fix it.