actions / runner-images

GitHub Actions runner images

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Mac OS X jobs frequently fail for no apparent reason with missing logs and without ever running `if: always ()` steps

JasonGross opened this issue · comments

Describe the bug
Frequently my Mac OS X jobs fail. Usually, when a job fails, the subsequent steps display as canceled and take 0s, and the if: always () steps run anyway, as in:
image
(From https://github.com/mit-plv/fiat-crypto/runs/593609449?check_suite_focus=true )
However, the Mac job failures, such as https://github.com/mit-plv/fiat-crypto/pull/753/checks?check_run_id=593732097 , display broken logs, where the failing step has no contents (no down arrow), and all subsequent steps fail with 0s, including the if: always () steps:
image
Furthermore, if I click the three dots and click "View raw logs", the logs are missing; I get directed to a page like https://github.com/mit-plv/fiat-crypto/commit/c31a955db7a356f1788d979ac9b1ed1a4fc67674/checks/593732097/logs which says only

2020-04-16T22:40:17.1881748Z ##[section]Starting: Request a runner to run this job
2020-04-16T22:40:17.9419267Z Requesting a hosted runner in current repository's account/organization with labels: 'macos-latest', require runner match: True
2020-04-16T22:40:18.0349714Z Labels matched hosted runners has been found, waiting for one of them get assigned for this job.
2020-04-16T22:40:18.0610578Z ##[section]Finishing: Request a runner to run this job

Area for Triage: Apple

Question, Bug, or Feature?: Bug

Virtual environments affected

  • macOS 10.15
  • Ubuntu 16.04 LTS
  • Ubuntu 18.04 LTS
  • Windows Server 2016 R2
  • Windows Server 2019

Expected behavior
I should get sensible logs, or, better, the jobs should not be failing at all (they work fine if I restart the job enough times, and they work fine consistently on Linux and often on Windows)

Actual behavior
See above. Link: https://github.com/mit-plv/fiat-crypto/pull/753/checks?check_run_id=593732097

@JasonGross , thank you for report of this issue!
Unfortunately, this repository manages only image content but we will try to escalate this issue to appropriate team.

@TingluoHuang , @ericsciple , @alepauly , is it something that actions/runner does?

@maxim-lobanov can you check runner diagnostic for this job to see why the runner can't upload log? I don't know how to access the runner log for hosted mac pool.

I'm now seeing this happen on our Linux jobs too, such as https://github.com/mit-plv/fiat-crypto/pull/766/checks?check_run_id=606437846
image
It seems to be happening on the artifact upload step on Linux, maybe the machines are running out of space or something?

Hello, Just a quick update, issue can come from bug on our backend. We are still looking at it.

@JasonGross , Hello!
Could you please check if you still see the same issues?

Closing this for now but please let us know if you still see the same issue

It's been happening less often, but it just happened again : https://github.com/JasonGross/fiat-crypto/runs/697783422
image

I've also seen the Mac OS jobs frequently show up as "cancelled" when I didn't cancel them, and I don't believe anyone else did, either.

So I guess this issue should be re-opened

We've had the same issue on one of our builds scheduled to run Monday through Friday at 01:30 UTC. It would be cancelled randomly after running for about 10-15 minutes.

Last night it succeeded for the first time in weeks but I will continue to monitor.

I had this problem, so I made an example project to show GitHub support https://github.com/joehinkle11/Mac-GitHub-Actions-Test/actions

They also responded with an email saying

Hi Joe,

Thank you for your continued patience while we investigated these issues. For context, there is an existing issue tracking this:

#736

Due to similar reliability reports and errors when using our current MacOS platform for GitHub Actions, we have decided to make larger changes that will take provide a long-term solution.

We understand that you may continue to experience reliability issues while on the current platform, and hope to provide a better experience as soon as possible. If you notice any issues with billing on the next billing cycle, please reach out.

At this time we have improvements planned for early July and will keep our customers up to date through our blogs and changelog

Please let us know if you have any questions or concerns!

Cheers,
GitHub Support

Hope this helps anyone who is working on an Action and doesn't yet realize it's a bug with GitHub and not their scripts.

I've also had frequent random cancellations of GH Actions jobs (especially Mac OS), with missing logs, such as https://github.com/mit-plv/fiat-crypto/runs/791672004?check_suite_focus=true
image
And here's one where the logs are present https://github.com/mit-plv/fiat-crypto/runs/791678094?check_suite_focus=true :
image

GitHub won't even tell me who canceled these jobs, or why they were canceled. (Was it because I pushed another commit that triggered the workflow? Is GitHub now forcibly canceling jobs on old commits, even those which are on the tip of their branch but are not the newest one running across all branches?)

I can also confirm that jobs on MacOS are cancelled for no apparent reason: https://github.com/cytopia/pwncat/pull/80/checks?check_run_id=792119613

Additionally to say there are not logs or other info regarding why it had been cancelled

We fixed a configuration issue in the service that causes mac hosted build hit this error every day in 1:00-2:00 AM UTC

We fixed a configuration issue in the service that causes mac hosted build hit this error every day in 1:00-2:00 AM UTC

That sounds promising 🎉 Is that fix already live?

@svenmuennich the fix is already live, and I can confirm from the telemetry that the fix works as expected.

👇 we no longer have the big spike every night.
image

Great! Thank you 🥇

Will keep this issue opened for a few more days, @svenmuennich , @cytopia , @JasonGross , could you please report back if you still see the same issues

We still see the same issues. Here is a build from 8 hours ago (Wed, 24 Jun 2020 08:08:20 GMT) that failed in this way: https://github.com/mit-plv/fiat-crypto/pull/817/checks?check_run_id=802532325
image

Attempting to fetch the raw logs gives

2020-06-24T08:08:06.3800369Z ##[section]Starting: Request a runner to run this job
2020-06-24T08:08:06.6634916Z Can't find any online and idle self-hosted runner in current repository that matches the required labels: 'macos-latest'
2020-06-24T08:08:06.6634949Z Can't find any online and idle self-hosted runner in current repository's account/organization that matches the required labels: 'macos-latest'
2020-06-24T08:08:06.6634965Z Found online and idle hosted runner in current repository's account/organization that matches the required labels: 'macos-latest'
2020-06-24T08:08:06.8916332Z ##[section]Finishing: Request a runner to run this job

which is bizarre.

@TingluoHuang , can it be something different?

https://github.com/github/c2c-actions-compute/issues/643 is a 404 for me; is there any issue I can track about this (other than this present one)?

Last night our scheduled build failed again. This time we got an error though:

An error occurred while provisioning resources (Error Type: Disconnect).

No idea whether that is related to this issue.

Hello everyone!
We have recently done some changes on our side. Could you please check if you still see the same issue (steps without logs)?

commented

Hello everyone!
We have recently done some changes on our side. Could you please check if you still see the same issue (steps without logs)?

I have the same problem on macOS https://github.com/atlas-engine/AtlasStudio/runs/1215960239.
Bildschirmfoto 2020-10-07 um 09 52 58

@maxim-lobanov

@maxim-lobanov Would you reopen this bug? https://github.com/mit-plv/fiat-crypto/runs/2979458972 has the log-less red ❌'s with missing raw logs
image

2021-07-03T15:24:52.2128487Z Can't find any online and idle self-hosted or hosted runner in the current repository, account/organization that matches the required labels: 'macos-latest'
2021-07-03T15:24:52.2128608Z Found online and busy hosted runner(s) in the current repository's organization account that matches the required labels: 'macos-latest'. Hit concurrency limits on the hosted runners. Waiting for one of them to get assigned for this job.
2021-07-03T15:24:52.2128637Z Waiting for a hosted runner in 'organization' to pick this job...

Download log archive results in an archive which simply does not contain logs for any of the red ❌'s, and with the same incomplete raw logs.

Hi @JasonGross! Sorry to hear that.
I've checked the telemetry and the root cause seems to be the same as here — #3517
We will notify the engineering team about these new cases.

We're also getting this. Just informing in case having more examples can help isolating the root cause. Example run at https://github.com/rubygems/rubygems/runs/2987926029.

image

I still got the issue, it keep processing very long time and finally automatically canceled with no reason.
while that step in progress, no logs is generated. And if I try to "View Raw Logs" it shows like below

image

Same issue.. "Run tests" part of MacOS-latest jobs randomly hang or take forever in https://github.com/combinators/cls-scala
Decided to factor them out into a separate workflow so I can restart them more easily until this gets resolved.

Any update on this?