python / buildmaster-config

Configuration for buildbot.python.org

Home Page:https://buildbot.python.org/all/#/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The buildbot server no longer start new builds

vstinner opened this issue · comments

$ date
Fri Aug 25 09:51:55 UTC 2023

$ grep 'starting build ' master/twistd.log
# empty

$ grep 'starting build ' master/twistd.log.1|tail

2023-08-24 16:38:47+0000 [-] starting build <Build AMD64 Fedora Stable Clang Installed 3.x number:4304 results:success>.. pinging the worker <WorkerForBuilder builder='AMD64 Fedora Stable Clang Installed 3.x' worker='cstratak-fedora-stable-x86_64' state=BUILDING>
2023-08-24 16:38:47+0000 [-] starting build <Build AMD64 Fedora Stable LTO + PGO 3.x number:4551 results:success>.. pinging the worker <WorkerForBuilder builder='AMD64 Fedora Stable LTO + PGO 3.x' worker='cstratak-fedora-stable-x86_64' state=BUILDING>
2023-08-24 16:45:35+0000 [-] starting build <Build s390x Fedora Rawhide 3.x number:None results:success> using worker <WorkerForBuilder builder='s390x Fedora Rawhide 3.x' worker='edelsohn-fedora-rawhide-z' state=AVAILABLE>
2023-08-24 16:45:35+0000 [-] starting build <Build s390x Fedora Rawhide Clang 3.x number:None results:success> using worker <WorkerForBuilder builder='s390x Fedora Rawhide Clang 3.x' worker='edelsohn-fedora-rawhide-z' state=AVAILABLE>
2023-08-24 16:45:36+0000 [-] starting build <Build s390x Fedora Rawhide Clang Installed 3.x number:None results:success> using worker <WorkerForBuilder builder='s390x Fedora Rawhide Clang Installed 3.x' worker='edelsohn-fedora-rawhide-z' state=AVAILABLE>
2023-08-24 16:45:36+0000 [-] starting build <Build s390x Fedora Rawhide 3.x number:3548 results:success>.. pinging the worker <WorkerForBuilder builder='s390x Fedora Rawhide 3.x' worker='edelsohn-fedora-rawhide-z' state=BUILDING>
2023-08-24 16:45:36+0000 [-] starting build <Build s390x Fedora Rawhide LTO + PGO 3.x number:None results:success> using worker <WorkerForBuilder builder='s390x Fedora Rawhide LTO + PGO 3.x' worker='edelsohn-fedora-rawhide-z' state=AVAILABLE>
2023-08-24 16:45:36+0000 [-] starting build <Build s390x Fedora Rawhide Clang 3.x number:3538 results:success>.. pinging the worker <WorkerForBuilder builder='s390x Fedora Rawhide Clang 3.x' worker='edelsohn-fedora-rawhide-z' state=BUILDING>
2023-08-24 16:45:37+0000 [-] starting build <Build s390x Fedora Rawhide Clang Installed 3.x number:3692 results:success>.. pinging the worker <WorkerForBuilder builder='s390x Fedora Rawhide Clang Installed 3.x' worker='edelsohn-fedora-rawhide-z' state=BUILDING>
2023-08-24 16:45:37+0000 [-] starting build <Build s390x Fedora Rawhide LTO + PGO 3.x number:3627 results:success>.. pinging the worker <WorkerForBuilder builder='s390x Fedora Rawhide LTO + PGO 3.x' worker='edelsohn-fedora-rawhide-z' state=BUILDING>

Yesterday, a config change was made:

commit 6fed7ed896f67054c89428f0ef048f9f3a6b0098 (HEAD -> main, origin/main, origin/HEAD)
Author: Zachary Ware <zach@python.org>
Date:   Thu Aug 24 11:38:26 2023 -0500

    Add issue template for worker addition (#383)

I suppose that the server was restarted to take in account this new config.

I restart the server to try to repair the issue:

 buildbot@buildbot:/srv/buildbot$ make stop-master
(...)
2023-08-25 09:54:04+0000 [-] Ignoring SIGTERM, master is already shutting down.

I have to kill the server :-(

buildbot@buildbot:/srv/buildbot$ ps ax|grep buildbot
1015538 pts/0    S      0:00 sudo -u buildbot -s -H
1016928 pts/0    S+     0:00 grep buildbot
3484334 ?        Sl   894:05 /srv/buildbot/venv/bin/python3.9 -c from twisted.scripts import twistd; twistd.run() --no_save --logfile=twistd.log --python=buildbot.tac
buildbot@buildbot:/srv/buildbot$ kill -9 3484334
buildbot@buildbot:/srv/buildbot$ ps ax|grep buildbot
1015538 pts/0    S      0:00 sudo -u buildbot -s -H
1017004 pts/0    S+     0:00 grep buildbot

I restarted the server:

$ make start-master

I see at least one issue: make stop-master is usually unable to stop the server. It seems like yesterday, deploying a new config triggered make stop-master. The server got SIGINT, stop starting new builds, but the server didn't stop neither, and so was not restarted properly.

Logs when the server was stopped or restarted:

buildbot@buildbot:/srv/buildbot$ grep -E 'Starting BuildMaster|SIGTERM|shutting' master/twistd.*
master/twistd.log.8:2023-08-22 23:16:31+0000 [-] Not shutting down, there are 1 builds running
master/twistd.log.8:2023-08-22 23:21:33+0000 [-] Starting BuildMaster -- buildbot.version: 3.8.0

master/twistd.log:2023-08-25 09:54:04+0000 [-] Ignoring SIGTERM, master is already shutting down.
master/twistd.log:2023-08-25 09:55:46+0000 [-] Starting BuildMaster -- buildbot.version: 3.8.0

The buildbot server no longer start new builds

@erlend-aasland saw this issue on a PR: python/cpython#108392

I also noticed that some buildbot were marked as failing on test_peg_generator, whereas I fixed it yesterday, and no new job was scheduled on related builders.

Examples of test_peg_generator failures:

Since I just restarted the server, new builds should now be scheduled again, it should be fine.

@zware @pablogsal @ambv: It's unclear how me how this issue occurs :-( It seems like the issue occurs when:

  1. I made a config change
  2. I apply the config change manually
  3. A cron job reapplies the change again

For me, one known issue is that make stop-master doesn't work :-( It sends a gentle SIGINT to Twiter who says "sure, I'm going to stop", but then Twisted hangs and never stops. In (3), I almost end up by using the even kill -9 pid. Honestly, I would suggest to do that in make stop-master as well. The command must do what it says, and not leave the buildbot server is a broken state: running but stuck.

Maybe the bug is that sometimes, the server is blocked in that stop: run but don't schedule new jobs since "it's being shut down".

I wrote PR #396 to fix it.

I wrote PR #396 to fix it.

Aaaand I suck at sysadmin, so I wrote PR #397 to fix my fix. Now we should be good.

I close the issue for now. Please open it if you see again the server running and it no longer spawns new jobs.