Mac OS X jobs frequently fail for no apparent reason with missing logs and without ever running `if: always ()` steps
JasonGross opened this issue · comments
Describe the bug
Frequently my Mac OS X jobs fail. Usually, when a job fails, the subsequent steps display as canceled and take 0s, and the if: always ()
steps run anyway, as in:
(From https://github.com/mit-plv/fiat-crypto/runs/593609449?check_suite_focus=true )
However, the Mac job failures, such as https://github.com/mit-plv/fiat-crypto/pull/753/checks?check_run_id=593732097 , display broken logs, where the failing step has no contents (no down arrow), and all subsequent steps fail with 0s, including the if: always ()
steps:
Furthermore, if I click the three dots and click "View raw logs", the logs are missing; I get directed to a page like https://github.com/mit-plv/fiat-crypto/commit/c31a955db7a356f1788d979ac9b1ed1a4fc67674/checks/593732097/logs which says only
2020-04-16T22:40:17.1881748Z ##[section]Starting: Request a runner to run this job
2020-04-16T22:40:17.9419267Z Requesting a hosted runner in current repository's account/organization with labels: 'macos-latest', require runner match: True
2020-04-16T22:40:18.0349714Z Labels matched hosted runners has been found, waiting for one of them get assigned for this job.
2020-04-16T22:40:18.0610578Z ##[section]Finishing: Request a runner to run this job
Area for Triage: Apple
Question, Bug, or Feature?: Bug
Virtual environments affected
- macOS 10.15
- Ubuntu 16.04 LTS
- Ubuntu 18.04 LTS
- Windows Server 2016 R2
- Windows Server 2019
Expected behavior
I should get sensible logs, or, better, the jobs should not be failing at all (they work fine if I restart the job enough times, and they work fine consistently on Linux and often on Windows)
Actual behavior
See above. Link: https://github.com/mit-plv/fiat-crypto/pull/753/checks?check_run_id=593732097
@JasonGross , thank you for report of this issue!
Unfortunately, this repository manages only image content but we will try to escalate this issue to appropriate team.
@TingluoHuang , @ericsciple , @alepauly , is it something that actions/runner does?
@maxim-lobanov can you check runner diagnostic for this job to see why the runner can't upload log? I don't know how to access the runner log for hosted mac pool.
I'm now seeing this happen on our Linux jobs too, such as https://github.com/mit-plv/fiat-crypto/pull/766/checks?check_run_id=606437846
It seems to be happening on the artifact upload step on Linux, maybe the machines are running out of space or something?
Hello, Just a quick update, issue can come from bug on our backend. We are still looking at it.
@JasonGross , Hello!
Could you please check if you still see the same issues?
Closing this for now but please let us know if you still see the same issue
It's been happening less often, but it just happened again : https://github.com/JasonGross/fiat-crypto/runs/697783422
I've also seen the Mac OS jobs frequently show up as "cancelled" when I didn't cancel them, and I don't believe anyone else did, either.
So I guess this issue should be re-opened
We've had the same issue on one of our builds scheduled to run Monday through Friday at 01:30 UTC. It would be cancelled randomly after running for about 10-15 minutes.
Last night it succeeded for the first time in weeks but I will continue to monitor.
I had this problem, so I made an example project to show GitHub support https://github.com/joehinkle11/Mac-GitHub-Actions-Test/actions
They also responded with an email saying
Hi Joe,
Thank you for your continued patience while we investigated these issues. For context, there is an existing issue tracking this:
Due to similar reliability reports and errors when using our current MacOS platform for GitHub Actions, we have decided to make larger changes that will take provide a long-term solution.
We understand that you may continue to experience reliability issues while on the current platform, and hope to provide a better experience as soon as possible. If you notice any issues with billing on the next billing cycle, please reach out.
At this time we have improvements planned for early July and will keep our customers up to date through our blogs and changelog
Please let us know if you have any questions or concerns!
Cheers,
GitHub Support
Hope this helps anyone who is working on an Action and doesn't yet realize it's a bug with GitHub and not their scripts.
I've also had frequent random cancellations of GH Actions jobs (especially Mac OS), with missing logs, such as https://github.com/mit-plv/fiat-crypto/runs/791672004?check_suite_focus=true
And here's one where the logs are present https://github.com/mit-plv/fiat-crypto/runs/791678094?check_suite_focus=true :
GitHub won't even tell me who canceled these jobs, or why they were canceled. (Was it because I pushed another commit that triggered the workflow? Is GitHub now forcibly canceling jobs on old commits, even those which are on the tip of their branch but are not the newest one running across all branches?)
I can also confirm that jobs on MacOS are cancelled for no apparent reason: https://github.com/cytopia/pwncat/pull/80/checks?check_run_id=792119613
Additionally to say there are not logs or other info regarding why it had been cancelled
We fixed a configuration issue in the service that causes mac hosted build hit this error every day in 1:00-2:00 AM UTC
We fixed a configuration issue in the service that causes mac hosted build hit this error every day in 1:00-2:00 AM UTC
That sounds promising 🎉 Is that fix already live?
@svenmuennich the fix is already live, and I can confirm from the telemetry that the fix works as expected.
Great! Thank you 🥇
Will keep this issue opened for a few more days, @svenmuennich , @cytopia , @JasonGross , could you please report back if you still see the same issues
We still see the same issues. Here is a build from 8 hours ago (Wed, 24 Jun 2020 08:08:20 GMT) that failed in this way: https://github.com/mit-plv/fiat-crypto/pull/817/checks?check_run_id=802532325
Attempting to fetch the raw logs gives
2020-06-24T08:08:06.3800369Z ##[section]Starting: Request a runner to run this job
2020-06-24T08:08:06.6634916Z Can't find any online and idle self-hosted runner in current repository that matches the required labels: 'macos-latest'
2020-06-24T08:08:06.6634949Z Can't find any online and idle self-hosted runner in current repository's account/organization that matches the required labels: 'macos-latest'
2020-06-24T08:08:06.6634965Z Found online and idle hosted runner in current repository's account/organization that matches the required labels: 'macos-latest'
2020-06-24T08:08:06.8916332Z ##[section]Finishing: Request a runner to run this job
which is bizarre.
@TingluoHuang , can it be something different?
@maxim-lobanov yes, https://github.com/mit-plv/fiat-crypto/pull/817/checks?check_run_id=802532325 failed due to https://github.com/github/c2c-actions-compute/issues/643 which is not the one I fixed.
https://github.com/github/c2c-actions-compute/issues/643 is a 404 for me; is there any issue I can track about this (other than this present one)?
Last night our scheduled build failed again. This time we got an error though:
An error occurred while provisioning resources (Error Type: Disconnect).
No idea whether that is related to this issue.
Hello everyone!
We have recently done some changes on our side. Could you please check if you still see the same issue (steps without logs)?
Hello everyone!
We have recently done some changes on our side. Could you please check if you still see the same issue (steps without logs)?
I have the same problem on macOS https://github.com/atlas-engine/AtlasStudio/runs/1215960239.
@maxim-lobanov: We also observe this behaviour from time to time. Example: https://github.com/alpaka-group/alpaka/runs/2708464529?check_suite_focus=true
@maxim-lobanov Would you reopen this bug? https://github.com/mit-plv/fiat-crypto/runs/2979458972 has the log-less red ❌'s with missing raw logs
2021-07-03T15:24:52.2128487Z Can't find any online and idle self-hosted or hosted runner in the current repository, account/organization that matches the required labels: 'macos-latest'
2021-07-03T15:24:52.2128608Z Found online and busy hosted runner(s) in the current repository's organization account that matches the required labels: 'macos-latest'. Hit concurrency limits on the hosted runners. Waiting for one of them to get assigned for this job.
2021-07-03T15:24:52.2128637Z Waiting for a hosted runner in 'organization' to pick this job...
Download log archive results in an archive which simply does not contain logs for any of the red ❌'s, and with the same incomplete raw logs.
Hi @JasonGross! Sorry to hear that.
I've checked the telemetry and the root cause seems to be the same as here — #3517
We will notify the engineering team about these new cases.
We're also getting this. Just informing in case having more examples can help isolating the root cause. Example run at https://github.com/rubygems/rubygems/runs/2987926029.
Same issue.. "Run tests" part of MacOS-latest jobs randomly hang or take forever in https://github.com/combinators/cls-scala
Decided to factor them out into a separate workflow so I can restart them more easily until this gets resolved.
Any update on this?