cloudbase / garm

GitHub Actions Runner Manager

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add force option on commands

Hdom opened this issue · comments

Over the past month I have been working on setting up GARM instance and have ran into plenty of times where a --force would have come in very handy.

Examples:

  1. (this happened many times accross 3 providers: k8s, gcp, openstack) Provider is unable to connect to infrastructure. Due to this the creation of a runner fails, but the delete method also tries to reach out to infrastructure and fails too, now I have a runner stuck on "pending_deleting" or "error" for which the retry loop will continue to attempt to delete, until connection is restored. I would like to be able to "force" it to delete the entry from the DB even if it didn't successfully delete from infrastructure. In this case since there is a network problem there was nothing to delete but even if something was created I can go to the infrastructure and do any manual deletion if needed.
  2. Perhaps this is a different ask but: Org oauth2 token lost permission to the org, I need to delete the associated pools, but to delete the pools I need to delete the runners (Error: [DELETE /pools/{poolID}][400] DeletePool default {Error:Bad Request Details:pool has runners}), but when I try to delete the runners I get error: Error: [DELETE /instances/{instanceName}][409] DeleteInstance default {Error:Conflict Details:pool manager is not running for Org, Org manager is not running because of error failed to update tools for repo Org: fetching runner tools: GET https://api.github.com/orgs/Org/actions/runners/downloads: 403 Must have admin rights to Repository. so now I have a runner and a pool stuck and no way to force deletion of either.

Hi @Hdom !

In regards to 1), there is a --force-remove-runner|-f available to the runner remove command. In version 0.1.4, that command will transition the runner to pending_force_delete and will ignore any provider error when removing the runner.

In regards to 2), the limitation was intentional, to avoid situations in which a PAT mai expire or a runner may be executing a job, and someone tries to remove it. In such cases we don't really want to remove the runner, as that may cancel a potentially long running job, when the underlying runner just gets removed while executing.

But you are right, this does have the potential to put us in a situation where it's difficult to recover. I think we can have a procedure to purge a pool that has no chance to ever come back up due to having lost access to the entiry it was managing.

I will look into this.

For situations where you can get a new PAT, the quick fix is to replace the old one in the config. The pool should recover.

Edit: Here are the release notes for v0.1.4 that detail the change when using garm-cli runner rm -f: https://github.com/cloudbase/garm/releases/tag/v0.1.4

This should fix 1). For 2) we need to find a proper solution.

Oh man, I don't know how I missed that force flag, I swear I had looked at the help menu for the runner delete before and even tried adding a -f, maybe because I was using the 0.1.4-rc version of the cli.

I think I added this after RC1 (not sure). Also, sometimes I forget to properly change the help messages 😅 . But in 0.1.4 it should work. At least for ignoring provider errors. The PAT not having access is a different code path. Will need to see what the best way forward is for that one.

Added a PR to ddress this here:

That change adds a -b | --bypass-github-unauthorized to the CLI and a query arg to the API. This flag will work even if the pool manager is stopped due to an Unauthorized error.

If nothing really works and you want to remove the runner from the DB, you can specify both --force-remove-runner and the --bypass-github-unauthorized options. This will bypass the stopped pool and ignore the provider error.

Give it a shot and let me know.