cloudbase / garm

GitHub Actions Runner Manager

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Failed runners are not removed on LXD

SystemKeeper opened this issue · comments

Sadly I have to report that it's still not working 100% on LXD.

# garm -version
d8ed552

# garm-cli runner list -a
+-------------------+---------+---------------+--------------------------------------+
| NAME              | STATUS  | RUNNER STATUS | POOL ID                              |
+-------------------+---------+---------------+--------------------------------------+
| garm-jWvAZKguDj1R | running | failed        | f52a63ee-7af9-4a19-b7f0-4f1f67de1579 |
+-------------------+---------+---------------+--------------------------------------+
| garm-uzQUmtNIXZyL | running | failed        | f52a63ee-7af9-4a19-b7f0-4f1f67de1579 |
+-------------------+---------+---------------+--------------------------------------+
| garm-H21L1ip0EZvY | running | failed        | f52a63ee-7af9-4a19-b7f0-4f1f67de1579 |
+-------------------+---------+---------------+--------------------------------------+
| garm-GEMOtSLckGH0 | running | failed        | f52a63ee-7af9-4a19-b7f0-4f1f67de1579 |
+-------------------+---------+---------------+--------------------------------------+


# lxc list -c n,s,c,4 | grep garm-jWvAZKguDj1R
| garm-jWvAZKguDj1R | RUNNING | 2023/06/11 15:16 UTC | 172.17.0.1 (docker0) |
# lxc list -c n,s,c,4 | grep garm-uzQUmtNIXZyL
| garm-uzQUmtNIXZyL | RUNNING | 2023/06/11 20:05 UTC | 172.17.0.1 (docker0) |
# lxc list -c n,s,c,4 | grep garm-H21L1ip0EZvY
| garm-H21L1ip0EZvY | RUNNING | 2023/06/12 01:24 UTC | 172.17.0.1 (docker0) |
# lxc list -c n,s,c,4 | grep garm-GEMOtSLckGH0
| garm-GEMOtSLckGH0 | RUNNING | 2023/06/12 02:31 UTC | 172.17.0.1 (docker0) |

So running for a while now but never get removed. Garm log has this:

# grep GEMOtSLckGH0 *
garm.log:2023/06/12 02:31:31 creating instance garm-GEMOtSLckGH0 in pool f52a63ee-7af9-4a19-b7f0-4f1f67de1579
garm.log:2023/06/12 02:35:02 instance garm-GEMOtSLckGH0 was updated recently, skipping check
garm.log:2023/06/12 02:35:04 instance garm-GEMOtSLckGH0 is online but github reports runner as offline
garm.log:2023/06/12 02:40:06 instance garm-GEMOtSLckGH0 is online but github reports runner as offline
garm.log:2023/06/12 02:45:13 instance garm-GEMOtSLckGH0 is online but github reports runner as offline
... and so on

Hi @SystemKeeper !

This is actually a different issue. Some context first:

Garm keeps track of two lifecycle states:

  • The lifecycle state of the instance the provider spins up (STATUS)
  • The lifecycle state of the github runner (RUNNER STATUS)

When a provider fails to spin up an instance for whatever reason, the STATUS field transitions to error and is retried 5 times before we give up. After about 20 minutes, the instance is reaped and if idle runners are configured, a new one is spun up. This repeats indefinitely.

In the above output, it seems the provider succeeded in creating the instance.

Once the instance comes up, the provider is responsible of making sure that the instance has everything it needs to set up the actual runner that joins github. In all cases (so far) this is done via userdata. The provider composes some userdata for an instance, injects it into the cloud and the instance pulls that and runs it. This is the process that makes sure the runner is installed and started.

Our userdata currently tries to download the runner, un-archive it, configure it, install the service and start the service. If any of those steps fail, the userdata sends a POST back to garm with the failure.

If you do a:

garm-cli runner show garm-jWvAZKguDj1R

you should see the failure reason.

If the runner appears on github as offline and we see it in the provider, we reach a condition where garm doesn't know how to recover:

https://github.com/cloudbase/garm/blob/main/runner/pool/pool.go#L496-L501

If you reached this point, it is likely that any retry will result in the same error, so we leave the instance alone to give the operator a chance to do a postmortem.

The fact that it shows up on github means it failed somewhere in here:

https://github.com/cloudbase/garm/blob/main/cloudconfig/templates.go#L129-L147

This is not a bug in garm itself. It may be a bug in the userdata that tries to set up the runner, which still needs to be fixed, but not in the pool manager. It may also be something inside the image you're using that causes the userdata to err out.

We'll know more once you run:

garm-cli runner show garm-jWvAZKguDj1R

and see what the failure reason is.

As a side note, runners that never register on github are automatically reaped after around 20 minutes (configurable per pool): https://github.com/cloudbase/garm/blob/main/runner/pool/pool.go#L181-L202

Also, you should be able manually remove these runners using:

garm-cli runner rm -f garm-jWvAZKguDj1R

I really need to document all this better. It's sad that there is more useful info in the issues than in the README. Will allocate some time this week for docs.

If you use the default upstream ubuntu:22.04 image, does it work?

Hey @gabriel-samfira

perfect explanation, thanks a lot!
Here is the requested output:

# garm-cli runner show garm-jWvAZKguDj1R
+-----------------+--------------------------------------------------------------+
| FIELD           | VALUE                                                        |
+-----------------+--------------------------------------------------------------+
| ID              | 21fde44c-1836-4129-9037-d662aa071948                         |
| Provider ID     | garm-jWvAZKguDj1R                                            |
| Name            | garm-jWvAZKguDj1R                                            |
| OS Type         | linux                                                        |
| OS Architecture | amd64                                                        |
| OS Name         | ubuntu                                                       |
| OS Version      | jammy                                                        |
| Status          | running                                                      |
| Runner Status   | failed                                                       |
| Pool ID         | f52a63ee-7af9-4a19-b7f0-4f1f67de1579                         |
| Addresses       | 172.17.0.1                                                   |
|                 | 10.19.136.32                                                 |
| Status Updates  | 2023-06-11T15:16:31: runner registration token was retrieved |
|                 | 2023-06-11T15:16:33: configuring runner                      |
|                 | 2023-06-11T15:18:21: failed to configure runner              |
+-----------------+--------------------------------------------------------------+

Output from syslog

Jun 11 15:16:33 garm-jWvAZKguDj1R cloud-init[753]: # Authentication
Jun 11 15:16:34 garm-jWvAZKguDj1R cloud-init[753]: Using V2 flow: False
Jun 11 15:16:40 garm-jWvAZKguDj1R cloud-init[753]: √ Connected to GitHub
Jun 11 15:16:40 garm-jWvAZKguDj1R cloud-init[753]: # Runner Registration
Jun 11 15:16:41 garm-jWvAZKguDj1R cloud-init[753]: √ Runner successfully added
Jun 11 15:16:55 garm-jWvAZKguDj1R systemd[1]: systemd-hostnamed.service: Deactivated successfully.
Jun 11 15:16:58 garm-jWvAZKguDj1R systemd[1]: systemd-timedated.service: Deactivated successfully.
Jun 11 15:17:01 garm-jWvAZKguDj1R CRON[836]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jun 11 15:18:21 garm-jWvAZKguDj1R cloud-init[753]: The HTTP request timed out after 00:01:40.
Jun 11 15:18:21 garm-jWvAZKguDj1R cloud-init: #############################################################
Jun 11 15:18:21 garm-jWvAZKguDj1R cloud-init: -----BEGIN SSH HOST KEY FINGERPRINTS-----
..
Jun 11 15:18:21 garm-jWvAZKguDj1R cloud-init: -----END SSH HOST KEY FINGERPRINTS-----
Jun 11 15:18:21 garm-jWvAZKguDj1R cloud-init: #############################################################
Jun 11 15:18:21 garm-jWvAZKguDj1R cloud-init[753]: Cloud-init v. 23.1.2-0ubuntu0~22.04.1 finished at Sun, 11 Jun 2023 15:18:21 +0000. Datasource DataSourceLXD.  Up 119.16 seconds
Jun 11 15:18:21 garm-jWvAZKguDj1R systemd[1]: Finished Execute cloud user/final scripts.
Jun 11 15:18:21 garm-jWvAZKguDj1R systemd[1]: Reached target Cloud-init target.
Jun 11 15:18:21 garm-jWvAZKguDj1R systemd[1]: Startup finished in 1min 58.840s.

Guess the http timeout is the problem here...

If you use the default upstream ubuntu:22.04 image, does it work?

Can't really tell, all our workflows are based around the github runner images, so hard to test this under real world conditions 😕

Maybe in general it would make sense to have a "Max runner runtime" property, similar to the "Runner Bootstrap Timeout" ? So after for example X minutes/hours the runner is stopped & removed even if it's considered running?! Github only terminates self-hosted runners after 35 days...

Interesting. It fails when running config.sh. It registers on GitHub and then times out. Does this happen every time or is it a transient error which only happens once in a while?

Maybe in general it would make sense to have a "Max runner runtime" property, similar to the "Runner Bootstrap Timeout" ?

I think that would be best. We can reap them as if they had their STATUS set to error.

In the meantime, I think it would be useful for you to enable metrics in garm and hook it up to prometheus (if you have it installed in your env). You can get an overview of how many runners you spin up, if runners are in error/failed state, etc.

https://github.com/cloudbase/garm/blob/main/testdata/config.toml#L30-L39

If you enable auth for metrics, you'll need to generate a token. That token is only valid for metrics:

garm-cli metrics-token create

You can then use or anything else that can make a request:

curl -X GET -H "Authorization: Bearer {TOKEN}" https://garm.example.com/metrics/

Interesting. It fails when running config.sh. It registers on GitHub and then times out. Does this happen every time or is it a transient error which only happens once in a while?

Looks like only once in a while. Currently 4 of the 20 runner are in that failed state, all 4 failed at failed to configure runner. So it might be a temporary network issue. Garm was build on the 06. June and I don't think there have been any manual restarts in the containers since then. So it's not a widespread issue.
Maybe even re-trying config.sh in the init script would help here already.

Thanks for the idea about the metrics, I'll see if I can enable that!

Looks like only once in a while.

Perfect.

Maybe even re-trying config.sh in the init script would help here already.

Definitely. An ExecWithRetry() function would be useful. Will add that too where it makes sense.

June and I don't think there have been any manual restarts in the containers since then. So it's not a widespread issue.

This sort of failure would not happen if you restart garm, so that's fine.

Will push an update tomorrow.

Thanks for the idea about the metrics, I'll see if I can enable that!

Let me know how that goes and if you feel there are any metrics that you think would make sense to add 😄.

Can you give #106 a try? Hopefully it fixes this issue.