Garm fails if one of multiple pool errors.

Question

Garm fails if one of multiple pool errors.

maigl opened this issue 2 years ago · comments

Issue:

In our current setup we run multiple pools with different credentials. Now one token expired and garm fails to start, even though all other pools have valid creds.

2022/10/18 16:19:39 starting pool manager: initializing tools: fetching runner tools: GET https://my.bad.repo: 403 Must have admin rights to Repository. []

Expected behavior:

garm should log error and ignore invalid pool but continue to manage all other pools.

_{Michael Kuhnt michael.kuhnt@mercedes-benz.com Mercedes-Benz Tech Innovation GmbH (ProviderInformation)}

Gabriel · Answer 1 · Wed Oct 19 2022 00:53:50 GMT+0800 (China Standard Time)

Yup. This is an annoying one I noticed, but never got around to fixing it. Having garm start and try to authenticate in the background should not be difficult. The more involved part is showing the user what's happening.

We need to add a pool status or a repo/org/enterprise status or a credentials status/validation command. I am not sure where the best place would be to add this. Maybe all of the above, but it will require some store changes, API changes and cli changes.

Will look into this soon, as it's been on my mind for a while. Luckily, this only happens on startup, and can be fixed by simply replacing the token, but it's extremely bad UX.

Thanks for opening this issue.

Michael Kuhnt · Answer 2 · Wed Oct 19 2022 14:12:37 GMT+0800 (China Standard Time)

... but, as those tokens do not necessarily are under my control, I might end up not be able to fix this myself. And I think there is also no way to remove the pool or org, apart from hacking the db, right?

Right now I just ignore the erroneous pool and skip it .. and continue with the rest .. maybe the solution could be as simple as that.

Gabriel · Answer 3 · Wed Oct 19 2022 15:47:45 GMT+0800 (China Standard Time)

TL;DR You care correct. The initial fix for this will be to skip all pools governed by the invalid token and not start the pool manager for them. Also add a status message when we list pools so the user knows it's not running.

TS;NM:

... but, as those tokens do not necessarily are under my control, I might end up not be able to fix this myself.

That is true. And this is indeed a bug that needs to be fixed. Just need to find some time to think about how to best approach this. The quick solution is to just skip a pool if it errs out, but the user will (currently) not know what has happened until they look at the logs. The way it works now is bad user experience, but not providing transparency into a reason of failure does not help much UX wise.

And I think there is also no way to remove the pool or org, apart from hacking the db, right?

You can use:

# disable the pool so no new runners are created
garm-cli pool update <pool ID> --enabled=false
# Remove any existing runners on the pool
garm-cli runner rm -f <runner name>
# remove pool
garm-cli pool rm <pool ID>
# remove org/repo/enterprise
garm-cli repo rm <repo ID>

But if GH creds are bad, this will most likely fail. You are correct.

Right now I just ignore the erroneous pool and skip it .. and continue with the rest .. maybe the solution could be as simple as that.

That is perfectly acceptable to get you unblocked.

Ultimately we will need to also make sure we let the user know that the pool is erroneous and why. When the issue is a github token, that will probably invalidate any pool created for a particular repo/org/enterprise as the token is tied to these entities and used to spin up pools. So a complete fix for this will probably need to touch all relevant bits:

Validate credentials on startup, or do that periodically in the background by issuing a simple API.
Display a status on the repo/org/enterprise indicating that we can manage runners on it (it exists and we have a valid token to manage it). If at some point a repo is removed from GitHub but we still have it defined in garm, we should be able to handle that scenario as well.
- Set an error state on the repo and set error details so an operator can query it by doing garm-cli repo show <repo ID>
- If the entity goes away, we might need to disable all pools created for it and clean up the runners from the provider, as they are most likely orphaned now.
Have a status and error details field in pools as well to record pool specific errors

I will have some extra time next week to look into this. A fix will span over multiple PRs, and we'll start by skipping the pool and setting a simple failed state with a message in the store, which will be displayed to the user. Then we'll work on robustness and handle more cases (like the ones mentioned above).

What are your thoughts? Does this sound reasonable?

Gabriel · Answer 4 · Thu Oct 20 2022 22:32:46 GMT+0800 (China Standard Time)

I updated the GHE PR to include a fix for this issue as well.

Garm will no longer fail on startup due to bad GH credentials:

ubuntu@garm:/tmp$ garm-cli repo ls
+--------------------------------------+-----------------+---------+------------------+------------------+
| ID                                   | OWNER           | NAME    | CREDENTIALS NAME | POOL MGR RUNNING |
+--------------------------------------+-----------------+---------+------------------+------------------+
| f0b1c1c8-b605-4560-adb7-79b95e2e470c | gABrIeL-SaMfIra | ScRiPtS | gabriel          | true             |
+--------------------------------------+-----------------+---------+------------------+------------------+
ubuntu@garm:/tmp$ garm-cli org ls
+--------------------------------------+----------+------------------+------------------+
| ID                                   | NAME     | CREDENTIALS NAME | POOL MGR RUNNING |
+--------------------------------------+----------+------------------+------------------+
| 3f7b0a5c-1d7a-4e52-b0c8-06116eca4091 | GsAmFiRa | gabriel_org      | false            |
+--------------------------------------+----------+------------------+------------------+
ubuntu@garm:/tmp$ garm-cli org show 3f7b0a5c-1d7a-4e52-b0c8-06116eca4091
+----------------------+--------------------------------------------------------------------------------+
| FIELD                | VALUE                                                                          |
+----------------------+--------------------------------------------------------------------------------+
| ID                   | 3f7b0a5c-1d7a-4e52-b0c8-06116eca4091                                           |
| Name                 | GsAmFiRa                                                                       |
| Credentials          | gabriel_org                                                                    |
| Pool manager running | false                                                                          |
| Failure reason       | failed to fetch tools from github for GsAmFiRa: "fetching tools: Unauthorized" |
| Pools                | 527906c8-051c-4169-bb04-1490e085e263                                           |
+----------------------+--------------------------------------------------------------------------------+
ubuntu@garm:/tmp$ garm-cli runner rm -f garm-ae9d3c7c-c7df-47fa-a15e-7a948fb10f43
Error: sending request: error in API call: pool manager is not running for GsAmFiRa

More changes to improve robustness will be added when I find more time.

Michael Kuhnt · Answer 5 · Fri Oct 21 2022 15:49:05 GMT+0800 (China Standard Time)

TL;DR You care correct. The initial fix for this will be to skip all pools governed by the invalid token and not start the pool manager for them. Also add a status message when we list pools so the user knows it's not running.

TS;NM:

... but, as those tokens do not necessarily are under my control, I might end up not be able to fix this myself.

That is true. And this is indeed a bug that needs to be fixed. Just need to find some time to think about how to best approach this. The quick solution is to just skip a pool if it errs out, but the user will (currently) not know what has happened until they look at the logs. The way it works now is bad user experience, but not providing transparency into a reason of failure does not help much UX wise.

And I think there is also no way to remove the pool or org, apart from hacking the db, right?

You can use:
# disable the pool so no new runners are created
garm-cli pool update <pool ID> --enabled=false
# Remove any existing runners on the pool
garm-cli runner rm -f <runner name>
# remove pool
garm-cli pool rm <pool ID>
# remove org/repo/enterprise
garm-cli repo rm <repo ID>
But if GH creds are bad, this will most likely fail. You are correct.

Right now I just ignore the erroneous pool and skip it .. and continue with the rest .. maybe the solution could be as simple as that.

That is perfectly acceptable to get you unblocked.

Ultimately we will need to also make sure we let the user know that the pool is erroneous and why. When the issue is a github token, that will probably invalidate any pool created for a particular repo/org/enterprise as the token is tied to these entities and used to spin up pools. So a complete fix for this will probably need to touch all relevant bits:
* Validate credentials on startup, or do that periodically in the background by issuing a simple API.

* Display a status on the repo/org/enterprise indicating that we can manage runners on it (it exists and we have a valid token to manage it). If at some point a repo is removed  from GitHub but we still have it defined in `garm`, we should be able to handle that scenario as well.
  
  * Set an error state on the repo and set error details so an operator can query it by doing `garm-cli repo show <repo ID>`
  * If the entity goes away, we might need to disable all pools created for it and clean up the runners from the provider, as they are most likely orphaned now.

* Have a status and error details field in pools as well to record pool specific errors
I will have some extra time next week to look into this. A fix will span over multiple PRs, and we'll start by skipping the pool and setting a simple failed state with a message in the store, which will be displayed to the user. Then we'll work on robustness and handle more cases (like the ones mentioned above).

What are your thoughts? Does this sound reasonable?

that sounds very good.. thanks.

Gabriel · Answer 6 · Fri Oct 21 2022 15:59:27 GMT+0800 (China Standard Time)

Give #37 a shot. A fix for this issue is included there. Let me know if it works as expected. You will need to build both garm and garm-cli.