actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Failing check runs causes erroneous scale down

rickardgranberg opened this issue · comments

Checks

Controller Version

0.27.6

Helm Chart Version

0.23.7

CertManager Version

1,12,2

Deployment Method

Helm

cert-manager installation

Not a problem with cert-manager.

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
  • I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
  • My actions-runner-controller version (v0.x.y) does support the feature
  • I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
  • I've migrated to the workflow job webhook event (if you using webhook driven scaling)

Resource Definitions

Not applicable, see #2118

To Reproduce

Not applicable, see #2118

Describe the bug

This bug is a continuation of issue #2118 more specifically the fixes implemented in PR #2520.
The fix tries to exclude check runs from causing scale down, but as can be seen on line 220 of the fix, it only does so if the check is successful:

if e.GetWorkflowJob().GetConclusion() == "success" && e.GetWorkflowJob().RunnerID == nil {

By the nature of check runs, they are intended to fail if there's a problem. So I question why this filter only applies for the "success" conclusion. As it is now, it causes erroneous scale down whenever a check fails (yes. like in the original issue, we're using EnricoMi/publish-unit-test-result-action)

One observation I have is that these events have the labels set to [] so perhaps that would be a better way to detect them?

{
  "action": "completed",
  "workflow_job": {
    <redacted>
    "status": "completed",
    "conclusion": "failure",
    "created_at": "2024-06-07T12:13:20Z",
    "started_at": "2024-06-07T12:13:20Z",
    "completed_at": "2024-06-07T12:13:20Z",
    "name": "Test Results",
    "steps": [

    ],
    "check_run_url": "https://api.github.com/repos/<redacted>",
    "labels": [

    ],
    "runner_id": null,
    "runner_name": null,
    "runner_group_id": null,
    "runner_group_name": null
  },

Describe the expected behavior

I expect check run events to not cause scale down regardless of the conclusion.

Whole Controller Logs

Not applicable, see #2118

Whole Runner Pod Logs

Not applicable, see #2118

Additional Context

No response

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

I will provide a PR with a fix if you accept one? #3594