Failing check runs causes erroneous scale down
rickardgranberg opened this issue · comments
Checks
- I've already read https://github.com/actions/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I'm sure my issue is not covered in the troubleshooting guide.
- I'm not using a custom entrypoint in my runner image
Controller Version
0.27.6
Helm Chart Version
0.23.7
CertManager Version
1,12,2
Deployment Method
Helm
cert-manager installation
Not a problem with cert-manager.
Checks
- This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
- I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
- My actions-runner-controller version (v0.x.y) does support the feature
- I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
- I've migrated to the workflow job webhook event (if you using webhook driven scaling)
Resource Definitions
Not applicable, see #2118
To Reproduce
Not applicable, see #2118
Describe the bug
This bug is a continuation of issue #2118 more specifically the fixes implemented in PR #2520.
The fix tries to exclude check runs from causing scale down, but as can be seen on line 220 of the fix, it only does so if the check is successful:
By the nature of check runs, they are intended to fail if there's a problem. So I question why this filter only applies for the "success" conclusion. As it is now, it causes erroneous scale down whenever a check fails (yes. like in the original issue, we're using EnricoMi/publish-unit-test-result-action)
One observation I have is that these events have the labels
set to []
so perhaps that would be a better way to detect them?
{
"action": "completed",
"workflow_job": {
<redacted>
"status": "completed",
"conclusion": "failure",
"created_at": "2024-06-07T12:13:20Z",
"started_at": "2024-06-07T12:13:20Z",
"completed_at": "2024-06-07T12:13:20Z",
"name": "Test Results",
"steps": [
],
"check_run_url": "https://api.github.com/repos/<redacted>",
"labels": [
],
"runner_id": null,
"runner_name": null,
"runner_group_id": null,
"runner_group_name": null
},
Describe the expected behavior
I expect check run events to not cause scale down regardless of the conclusion.
Whole Controller Logs
Not applicable, see #2118
Whole Runner Pod Logs
Not applicable, see #2118
Additional Context
No response
Hello! Thank you for filing an issue.
The maintainers will triage your issue shortly.
In the meantime, please take a look at the troubleshooting guide for bug reports.
If this is a feature request, please review our contribution guidelines.
I will provide a PR with a fix if you accept one? #3594