ARC not working with ResourceQuotas. Fails to schedule pod instead of queuing.

Question

ARC not working with ResourceQuotas. Fails to schedule pod instead of queuing.

ropelli opened this issue 11 days ago · comments

Roope Kallio commented 11 days ago

Checks

I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
I am using charts that are officially provided

Controller Version

0.9.3

Deployment Method

Helm

Checks

This isn't a question or user support case (For Q&A and community support, go to Discussions).
I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

You need to request more cpu,memory,storage etc. than the resource quota. For example:

Define resource quota with hard limit for 8 cpus

apiVersion: v1
kind: ResourceQuota
metadata:
  name: arc-runners-quota
  namespace: arc-runners
spec:
  hard:
    requests.cpu: "8"

Set up ARC with autoscalingrunnerset with more than half of the cpus for runner container resource requests: let's do 5.

apiVersion: actions.github.com/v1alpha1
kind: AutoscalingRunnerSet
metadata:
  name: self-hosted
  namespace: arc-runners
spec:
...
  template:
    spec:
     containers:
     - name: runner
       resources:
         requests:
           cpu: "5"
...

Run a workflow with two jobs matching the autoscalingunnerset name

push:
  branches: [ main ]
jobs:
  job1:
    runs-on: self-hosted
    steps:
    - run: sleep 60
  job2:
    runs-on: self-hosted
    steps:
    - run: sleep 60

You can also do this with a single job that goes over the resource quota. But above is more likely scenario.

Describe the bug

In the example provided, one job will run, the other will get stuck waiting for a runner as the ephemeralrunner ends up in Failed state.

In general, jobs get stuck waiting for a runner that never appears until another job is scheduled for same runner scale set.

Describe the expected behavior

In the example provided, one job should run at a time and queue properly and complete one after the other. Leading to a successful build.

In general, when quota is temporarily exceeded, we should try again after a while preferably through a queue implementation.

Additional Context

Previous issues where removing ResourceQuota helped:
https://github.com/actions/actions-runner-controller/issues/3211#issuecomment-1883410610
https://github.com/actions/actions-runner-controller/issues/3191#issuecomment-1883407473

Controller Logs

https://gist.github.com/ropelli/86ac726df685716b2e7e510a72e63139

Runner Pod Logs

No runner pod. No logs

github-actions · Answer 1 · Fri Jun 28 2024 17:34:16 GMT+0800 (China Standard Time)

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

Roope Kallio · Answer 2 · Mon Jul 01 2024 15:58:30 GMT+0800 (China Standard Time)

I created a separate issue for k8s mode and container hooks: #3630. There the jobs fail if there's not enough quota at the moment available.

Roope Kallio · Answer 3 · Thu Jul 04 2024 17:29:52 GMT+0800 (China Standard Time)

Following change seems to "fix" at least the simple case with two jobs but this is probably not the way to go and I would not recommend it:

diff --git a/controllers/actions.github.com/ephemeralrunner_controller.go b/controllers/actions.github.com/ephemeralrunner_controller.go
index 36ea114..be82878 100644
--- a/controllers/actions.github.com/ephemeralrunner_controller.go
+++ b/controllers/actions.github.com/ephemeralrunner_controller.go
@@ -21,6 +21,7 @@ import (
        "errors"
        "fmt"
        "net/http"
+       "strings"
        "time"
 
        "github.com/actions/actions-runner-controller/apis/actions.github.com/v1alpha1"
@@ -216,6 +217,21 @@ func (r *EphemeralRunnerReconciler) Reconcile(ctx context.Context, req ctrl.Requ
                        case err == nil:
                                return result, nil
                        case kerrors.IsInvalid(err) || kerrors.IsForbidden(err):
+                               if strings.Contains(err.Error(), "exceeded quota") {
+                                       log.Info("Failed to create a pod due to quota exceeded. Let's try again later")
+                                       log.Error(err, "Error: ")
+                                       err := r.Patch(ctx, ephemeralRunner, client.RawPatch(types.MergePatchType, []byte(`{"metadata":{"finalizers":[]}}`)))
+                                       if err != nil {
+                                               log.Error(err, "Error: ")
+                                               return ctrl.Result{}, err
+                                       }
+                                       err = r.Delete(ctx, ephemeralRunner)
+                                       if err != nil {
+                                               log.Error(err, "Error: ")
+                                               return ctrl.Result{}, err
+                                       }
+                                       return ctrl.Result{}, nil
+                               }
                                log.Error(err, "Failed to create a pod due to unrecoverable failure")
                                errMessage := fmt.Sprintf("Failed to create the pod: %v", err)
                                if err := r.markAsFailed(ctx, ephemeralRunner, errMessage, ReasonInvalidPodFailure, log); err != nil {