actions / actions-runner-controller

Checks

I've already read https://github.com/actions/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I'm sure my issue is not covered in the troubleshooting guide.
I'm not using a custom entrypoint in my runner image

Controller Version

0.27.6

Helm Chart Version

0.23.7

CertManager Version

1,12,2

Deployment Method

Helm

cert-manager installation

Not a problem with cert-manager.

Checks

This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
My actions-runner-controller version (v0.x.y) does support the feature
I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
I've migrated to the workflow job webhook event (if you using webhook driven scaling)

Resource Definitions

Not applicable, see #2118

To Reproduce

Not applicable, see #2118

Describe the bug

This bug is a continuation of issue #2118 more specifically the fixes implemented in PR #2520.
The fix tries to exclude check runs from causing scale down, but as can be seen on line 220 of the fix, it only does so if the check is successful:

actions-runner-controller/controllers/actions.summerwind.net/horizontal_runner_autoscaler_webhook.go

Line 220 in 8947327

    
           if e.GetWorkflowJob().GetConclusion() == "success" && e.GetWorkflowJob().RunnerID == nil {

By the nature of check runs, they are intended to fail if there's a problem. So I question why this filter only applies for the "success" conclusion. As it is now, it causes erroneous scale down whenever a check fails (yes. like in the original issue, we're using EnricoMi/publish-unit-test-result-action)

One observation I have is that these events have the labels set to [] so perhaps that would be a better way to detect them?

{
  "action": "completed",
  "workflow_job": {
    <redacted>
    "status": "completed",
    "conclusion": "failure",
    "created_at": "2024-06-07T12:13:20Z",
    "started_at": "2024-06-07T12:13:20Z",
    "completed_at": "2024-06-07T12:13:20Z",
    "name": "Test Results",
    "steps": [

    ],
    "check_run_url": "https://api.github.com/repos/<redacted>",
    "labels": [

    ],
    "runner_id": null,
    "runner_name": null,
    "runner_group_id": null,
    "runner_group_name": null
  },

Describe the expected behavior

I expect check run events to not cause scale down regardless of the conclusion.

Whole Controller Logs

Not applicable, see #2118

Whole Runner Pod Logs

Not applicable, see #2118

Additional Context

No response

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

I will provide a PR with a fix if you accept one? #3594

Failing check runs causes erroneous scale down

Checks

Controller Version

Helm Chart Version

CertManager Version

Deployment Method

cert-manager installation

Checks

Resource Definitions

To Reproduce

Describe the bug

Describe the expected behavior

Whole Controller Logs

Whole Runner Pod Logs

Additional Context