cloudbase / garm

GitHub Actions Runner Manager

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

garm spawns multiple runners when instance in "pending_create" state (custom kubernetes provider)

rafalgalaw opened this issue · comments

Over the last days I have implemented a custom kubernetes provider, which should spawn pods with a github action runner. However after several hours of debugging, one issue in particular occured to me:

  1. After spawning a new Pod, my provider reports the result back to garm just fine, with a status of "pending_create"
    {
    "provider_id": "runner-28bfa970-2a7a-4158-8e6a-4848034c543a",
    "agent_id": 0,
    "name": "garm-styaLhsim9Dx",
    "os_type": "linux",
    "os_name": "ubuntu",
    "os_version": "22.04",
    "os_arch": "arm64",
    "status": "pending_create",
    "pool_id": "6c6798ff-552b-4f8e-9629-352c8b3036a9",
    "updated_at": "0001-01-01T00:00:00Z",
    "github-runner-group": ""
    }

  2. Inside the consolidate loop in garm, before attempting to create a new pending instance, the current instance status is set to "creating", which in theory should prevent garm from spawning a new instance again for the next iteration of the consolidate loop:
    addPendingInstances

  3. However it seems, that after the status of the Instance is set to "creating", the providers CreateInstance is run which returns an updated instance where the status field is "pending_create" again, because the pod is still not ready yet and therefore in a "pending_create" status, which repeats the whole process every 5 seconds. Am I missing something here? Would appreciate any hint or help :)

Hi @rafalgalaw !

Ohh cool! I never thought someone would like a k8s provider for garm, considering there is the ARC controller 😄 . This is really nice to see!

As to your immediate issue some context first.

The lifecycle state of a runner is as follows:

  • garm determines if it needs to create a new runner, either based on a webhook it received from github for a queued job, or to ensure the minimum idle runners set
  • If a new runner needs to be added, an entry is set in the DB with the status of pending_create.
  • The consolidate loop looks for instances in pending_create and executes the provider's CreateInstance() function. At this point, garm itself transitions the runner into creating. So far, the provider is not involved in setting the state of the runner.
  • When CreateInstance() returns, it should either return an Instance{} object or an error code if something failed. Optionally you can return some error info that will be displayed in the instance details under provider_fault.
    • Your provider should only return one of the following instance statuses, as a result of calling CreateInstance(), ListInstances(), GetInstance():
    • The other states are set by garm itself
  • When a runner has finished running a job, a github hook is sent to garm, and the instance is set in pending_delete as a result.
  • During the consolidate loop, garm will look for instances to delete, and transition them to deleting.
    • The DeleteInstance function is called. If it succeeds, the instance is deleted from the store. If it fails, garm will set the instance back to pending_delete and retry in the next consolidate loop run.

If your provider is written in Go, there are a couple of examples for OpenStack and an Azure. You can have a look at those for pointers. To create a new provider, you essentially have to implement the external provider interface.

There is some boilerplate you need to implement, bit it's quite small. you can see it here:

https://github.com/cloudbase/garm-provider-openstack/blob/main/main.go#L18-L36

The execution.GetEnvironment() function parses environment variables and stdin. All the info you need will be in there. You can then implement your provider, return it and pass it to execution.Run()

If you're writing your provider in bash, there are a couple of samples here: https://github.com/cloudbase/garm/tree/main/contrib/providers.d

Let me know if I can help out. Is your provider public?

I think the issue you're seeing is because your provider returned pending_create instead of running or error. And it reset the status back to a state that made the consolidate loop create a new pod.

Lifecycle state is pending_create --> creating --> running. And when the runner is removed by github: pending_delete --> deleting. This is for the instance. In this case, your pod. The github runner state is different and is usually set when the runner is installing via userdata. In your case, this might happen in the entrypoint or in an init container. The usual installation script for runners running on VMs is:

function sendStatus() {
MSG="$1"
call "{\"status\": \"installing\", \"message\": \"$MSG\"}"
}
function success() {
MSG="$1"
ID=$2
call "{\"status\": \"idle\", \"message\": \"$MSG\", \"agent_id\": $ID}"
}
function fail() {
MSG="$1"
call "{\"status\": \"failed\", \"message\": \"$MSG\"}"
exit 1
}
.

Also, the ID of the runner suggests you may be running an older version of garm. I suggest you update it if possible.

Scratch that. Was looking at the provider ID.

We are currently running Arc and Garm side by side, but are evaluating if it is less maintenance overhead to just use Garm for runners in VMs and Containers, so this is more of a PoC, written in Go :)
But thank you very much for your detailed response, I simply missed the part with the valid status values after CreateInstance.

My pleasure! I need to allocate some time to write proper docs. The ones that exist now are really sparse. Feel free to ping me if you need any help! 😄