buildbuddy-io / buildbuddy

BuildBuddy is an open source Bazel build event viewer, result store, remote cache, and remote build execution platform.

Home Page:https://buildbuddy.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Windows builds constantly failing with IOException: connection was forcibly closed by the remote host

avdv opened this issue · comments

We are using Bazel 6.2.0 currently, but the issue existed before (I tried to follow suggestions in #992 but no dice).

The issue only happens for the Windows builds, Linux and Darwin are fine.

ERROR: The Build Event Protocol upload failed: Not retrying publishBuildEvents, no more attempts left: status='Status{code=UNAVAILABLE, description=io exception, cause=java.io.IOException: An existing connection was forcibly closed by the remote host
	at java.base/sun.nio.ch.SocketDispatcher.read0(Native Method)
	at java.base/sun.nio.ch.SocketDispatcher.read(Unknown Source)
	at java.base/sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
	at java.base/sun.nio.ch.IOUtil.read(Unknown Source)
	at java.base/sun.nio.ch.IOUtil.read(Unknown Source)
	at java.base/sun.nio.ch.SocketChannelImpl.read(Unknown Source)
	at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:258)
	at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1132)
	at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:722)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Unknown Source)
}' UNAVAILABLE: An existing connection was forcibly closed by the remote host UNAVAILABLE: An existing connection was forcibly closed by the remote host
INFO: Build completed successfully, 511 total actions
Error: Process completed with exit code 38.

CI runs: 1 (bb), 2 (bb), 3 (bb)

Note, we have 4 jobs in the workflow running on Windows, and one of them succeeded: https://github.com/tweag/rules_haskell/actions/runs/5738410548/job/15564769980?pr=1925

The successful job took a bit over one hour, the failing jobs took >5 hours. Could this be just some sort of timeout?

We are using:

--bes_upload_mode=wait_for_upload_complete
--bes_timeout=60s

Should we try something else? Or tune different parameters? Thank you for any suggestions!

@sluongng any chance you're able to reproduce this on Windows?

This kind of feels like an error that Bazel would handle gracefully / retry on Linux (this suggests that this is the windows error message for a RST stream).

I believe our servers send an RST stream when they're restarting (and the 3 errors linked are around this time of our release yesterday when all of the apps are started, which happened around noon PST).

The question for me is why isn't Bazel able to successfully retry the BES stream on Windows like it seems to be able to on Linux / Mac (otherwise we could never restart our servers / perform releases).

I have not seen this before while I was trying to build BuildBuddy on Windows 🤔

Gona reach out to Tweag folks on Slack to see if I could reproduce with rules_haskell or not... though I doubt that this is specific to a repo, there might be some flags/options specific to rules_haskell CI (or their GitHub Action Runner) that could affect this.

Seems like it happened during a release, you could try running a Windows build against a local server and restarting it mid build.

Slightly hard since our server does not build on Windows 😅
But I took that as a challenge: I set up a personal VPN on my Windows VM and ran //server on my MacBook.

Tried to stop and restart server on my MacBook 2 times mid-build but could not reproduce the issue 🤔

I tried a bunch of approaches and could not replicate the exact error.

@avdv It would be helpful if you could help capture the java log in $(bazel info output-base)/java.log*, specifically the logs from BuildEventServiceUploader would be pretty beneficial. I want to check the timing of the Retrying stream log and its reasons.

It's likely that the 5 hours build duration overlapping with our weekly release time frame was the root cause of this: our old server received the stream -> was interrupted for upgrade -> Bazel retried to another server that would also get interrupted later. If this is indeed the case, bumping --experimental_build_event_upload_max_retries=4 to a higher number would be a good solution.

Thank you!

It would be helpful if you could help capture the java log in $(bazel info output-base)/java.log*, specifically the logs from BuildEventServiceUploader would be pretty beneficial. I want to check the timing of the Retrying stream log and its reasons.

I'll try to reproduce and attach the files.

discussed with @avdv on Slack and tested it myself in https://github.com/sluongng/rules_haskell/actions/runs/5785750975/job/15678998276 https://app.buildbuddy.io/invocation/c912e3eb-c949-4872-9246-47a743b7425d

It seems that the issue is unrelated to our weekly release. I could not reproduce it on AWS EC2 Windows VM, but it seems like the Github Action's managed runner (on Azure) could be easily reproduced.

One thing I've noticed is that the Github action runners on Linux seem to have pretty flaky networking, and that setting this kernel param seems to help:

sudo sysctl -w net.ipv4.tcp_keepalive_time=60

I wonder if there is an equivalent for Windows. Based on this it seems the default is 2 hours and it can be tweaked: https://serverfault.com/a/735768

I have made a PR here where I uploaded log files as artifacts. See here.

Looking at the "windows-latest rules_haskell bzlmod" job's logs:

230808 05:23:49.822:I 1 [com.google.devtools.build.lib.runtime.BlazeRuntime.main] Starting Bazel server with pid 1876 and , args [--max_idle_secs=10800, --noshutdown_on_low_│
sys_mem, --connect_timeout_secs=30, --output_user_root=c:\_bzl, --install_base=c:\_bzl\install\b81619668f3c61a40ed47421c0adf7f4, --install_md5=b81619668f3c61a40ed47421c0adf7│
f4, --output_base=c:\_bzl\minshlu6, --workspace_directory=d:\a\rules_haskell\rules_haskell, --default_system_javabase=C:/hostedtoolcache/windows/Java_Temurin-Hotspot_jdk/8.0│
.382-5/x64, --failure_detail_out=c:\_bzl\minshlu6\failure_detail.rawproto, --expand_configs_in_place, --idle_server_tasks, --write_command_log, --nowatchfs, --nofatal_event_│
bus_exceptions, --nowindows_enable_symlinks, --client_debug=false, --host_jvm_args=-Dbazel.windows_unix_root=C:\Program Files\Git\usr\bin\bash.exe, --host_jvm_args=-Djdk.tls│
.client.protocols=TLSv1.2, --product_name=Bazel, --option_sources=host_Ujvm_Uargs:d_C\a\rules_Uhaskell\rules_Uhaskell\.bazelrc.common:output_Uuser_Uroot:d_C\a\rules_Uhaskell│
\rules_Uhaskell\.bazelrc.local]
...
230808 05:23:53.701:I 20 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.logIds] Streaming Build Event Protocol to 'grpcs://remote.buildbuddy.io' wi│
th build_request_id: '0ae44fad-579a-44e4-a296-fd383a6c5571' and invocation_id: '514cd9ca-ecd6-45e6-89f5-f3b0bea0defa'                                                        │
230808 05:24:00.465:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=2           │
230808 05:36:46.267:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Retrying stream: status='Status{code=UNAVAILABLE, de│
scription=io exception, cause=java.io.IOException: An existing connection was forcibly closed by the remote host                                                             │
230808 05:36:47.273:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=237         │
230808 05:41:22.864:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Retrying stream: status='Status{code=UNAVAILABLE, de│
scription=io exception, cause=java.io.IOException: An existing connection was forcibly closed by the remote host                                                             │
230808 05:41:24.467:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=237         │
230808 05:46:00.879:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Retrying stream: status='Status{code=UNAVAILABLE, de│
scription=io exception, cause=java.io.IOException: An existing connection was forcibly closed by the remote host                                                             │
230808 05:46:03.721:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=237         │
230808 05:50:40.630:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Retrying stream: status='Status{code=UNAVAILABLE, de│
scription=io exception, cause=java.io.IOException: An existing connection was forcibly closed by the remote host                                                             │
230808 05:50:44.732:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=237         │
230808 05:57:58.894:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Not retrying publishBuildEvents, no more attempts le│
ft: status='Status{code=UNAVAILABLE, description=io exception, cause=java.io.IOException: An existing connection was forcibly closed by the remote host                      │
230808 05:57:58.924:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] About to cancel all local file uploads              │
230808 05:57:58.925:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Spent 0 milliseconds doing local file upload cancell│
ation                                                                                                                                                                        │
230808 05:57:59.062:X 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.logAndSetException] The Build Event Protocol upload failed: Not retrying│
 publishBuildEvents, no more attempts left: status='Status{code=UNAVAILABLE, description=io exception, cause=java.io.IOException: An existing connection was forcibly closed │
by the remote host                                                                                                                                                           │
230808 06:35:42.655:I 20 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.waitForBuildEventTransportsToClose] Spent 0 milliseconds doing waiting for │
BES close for invocation 514cd9ca-ecd6-45e6-89f5-f3b0bea0defa                                                                                                                │
230808 06:35:42.669:WT 20 [com.google.devtools.build.lib.runtime.BlazeRuntime.afterCommand] While running afterCommand() on BazelBuildEventServiceModule                     │
com.google.devtools.build.lib.util.AbruptExitException: The Build Event Protocol upload failed: Not retrying publishBuildEvents, no more attempts left: status='Status{code=U│
NAVAILABLE, description=io exception, cause=java.io.IOException: An existing connection was forcibly closed by the remote host                                               │
        at com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.logAndSetException(BuildEventServiceUploader.java:298)                                  │
        at com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.run(BuildEventServiceUploader.java:334)                                                 │
Caused by: com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader$DetailedStatusException: UNAVAILABLE: io exception                                      │
        at com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.withFailureDetail(BuildEventServiceUploader.java:731)                                   │
        at com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents(BuildEventServiceUploader.java:547)                                  │
        at com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.run(BuildEventServiceUploader.java:311)                                                 │
230808 06:35:42.669:I 20 [com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.exec] Exit status was DetailedExitCode{exitCode=PUBLISH_ERROR, failureDetail=message: │
"The Build Event Protocol upload failed: Not retrying publishBuildEvents, no more attempts left: status=\'Status{code=UNAVAILABLE, description=io exception, cause=java.io.IO│
Exception: An existing connection was forcibly closed by the remote host\r\n\tat java.base/sun.nio.ch.SocketDispatcher.read0(Native Method)\r\n\tat java.base/sun.nio.ch.Sock│
etDispatcher.read(Unknown Source)\r\n\tat java.base/sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)\r\n\tat java.base/sun.nio.ch.IOUtil.read(Unknown Source)\r\n\tat j│
ava.base/sun.nio.ch.IOUtil.read(Unknown Source)\r\n\tat java.base/sun.nio.ch.SocketChannelImpl.read(Unknown Source)\r\n\tat io.netty.buffer.PooledByteBuf.setBytes(PooledByte│
Buf.java:258)\r\n\tat io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1132)\r\n\tat io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel│
.java:350)\r\n\tat io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151)\r\n\tat io.netty.channel.nio.NioEventLoop.processSelectedK│
ey(NioEventLoop.java:722)\r\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)\r\n\tat io.netty.channel.nio.NioEventLoop.processSele│
ctedKeys(NioEventLoop.java:584)\r\n\tat io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)\r\n\tat io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleT│
hreadEventExecutor.java:986)\r\n\tat io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\r\n\tat io.netty.util.concurrent.FastThreadLocalRunnable.run(F│
astThreadLocalRunnable.java:30)\r\n\tat java.base/java.lang.Thread.run(Unknown Source)\r\n}\' UNAVAILABLE: An existing connection was forcibly closed by the remote host UNAV│
AILABLE: An existing connection was forcibly closed by the remote host"

So the build started at 05:23:49 and the first disconnection happened at 05:36:46. Then it happens after 7, 5, 4 and 7 minutes again. I'll just try to bump the retries to a high number.

Thank you @siggisim, I'll look into the TCP keepalive setting too.

I did set the KeepAliveTime registry entry to 300000 milliseconds = 5 mins, but that did not help. Yet it almost seems to happen at steady rate once every 5mins now...

230809 06:29:01.478:I 20 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.logIds] Streaming Build Event Protocol to 'grpcs://remote.buildbuddy.io' wi│
230809 06:29:21.604:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=2           │
230809 06:34:05.272:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Retrying stream: status='Status{code=UNAVAILABLE, de│
230809 06:34:06.272:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=134         │
230809 06:45:19.297:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Retrying stream: status='Status{code=UNAVAILABLE, de│
230809 06:45:20.902:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=318         │
230809 06:49:50.790:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Retrying stream: status='Status{code=UNAVAILABLE, de│
230809 06:49:53.353:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=318         │
230809 06:54:22.370:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Retrying stream: status='Status{code=UNAVAILABLE, de│
230809 06:54:26.471:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=318         │
230809 06:58:55.971:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Retrying stream: status='Status{code=UNAVAILABLE, de│
230809 06:59:02.538:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=318         │
230809 07:03:31.893:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Retrying stream: status='Status{code=UNAVAILABLE, de│
230809 07:03:42.384:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=318         │
230809 07:10:15.104:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Retrying stream: status='Status{code=UNAVAILABLE, de│
230809 07:10:31.887:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=386         │
230809 07:19:28.817:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Retrying stream: status='Status{code=UNAVAILABLE, de│
230809 07:19:55.674:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=389         │
230809 07:24:26.380:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Retrying stream: status='Status{code=UNAVAILABLE, de│
230809 07:25:09.346:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=389         │
230809 07:29:39.393:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Retrying stream: status='Status{code=UNAVAILABLE, de│
230809 07:30:48.115:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=389         │
230809 07:35:17.988:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Retrying stream: status='Status{code=UNAVAILABLE, de│
230809 07:37:07.943:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=390         │
230809 07:41:37.300:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Retrying stream: status='Status{code=UNAVAILABLE, de│
230809 07:44:33.241:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=390         │
230809 07:50:15.981:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Retrying stream: status='Status{code=UNAVAILABLE, de│
230809 07:54:57.464:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Starting publishBuildEvents: eventQueue=445         │
230809 07:54:57.620:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] BES uploader is half-closed                         │
230809 07:54:58.451:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] publishBuildEvents was successful                   │
230809 07:54:58.451:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] About to cancel all local file uploads              │
230809 07:54:58.451:I 232 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceUploader.publishBuildEvents] Spent 0 milliseconds doing local file upload cancell│
230809 07:54:58.621:I 20 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.waitForBuildEventTransportsToClose] Spent 21048 milliseconds doing waiting │

Setting --experimental_build_event_upload_max_retries=256 did remedy the issue although I saw "The Build Event Protocol upload timed out." in one job instead.

Just to note (seems I only mentioned this in Slack), we are mostly seeing this error with PRs from forks (where the BuildBuddy API secret is not set, so we only have a read-only cache).

Also, I made a PR adding the max retries setting and it failed with an upload timeout error again.

230809 10:35:02.715:I 20 [com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.waitForBuildEventTransportsToClose] Spent 60000 milliseconds doing waiting for BES close for invocation 0aca44ca-32b2-41f6-9760-2a7ca8956818
230809 10:35:02.720:WT 20 [com.google.devtools.build.lib.runtime.BlazeRuntime.afterCommand] While running afterCommand() on BazelBuildEventServiceModule
com.google.devtools.build.lib.util.AbruptExitException: The Build Event Protocol upload timed out. com.google.common.util.concurrent.TimeoutFuture$TimeoutFutureException: Timed out: NonCancellationPropagatingFuture@4cd9776e[status=PENDING, info=[delegate=[SettableFuture@9dbe091[status=PENDING]]]]
	at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.createAbruptExitException(BuildEventServiceModule.java:851)
	at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.waitForBuildEventTransportsToClose(BuildEventServiceModule.java:512)
	at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.closeBepTransports(BuildEventServiceModule.java:608)
	at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.afterCommand(BuildEventServiceModule.java:623)
	at com.google.devtools.build.lib.runtime.BlazeRuntime.afterCommand(BlazeRuntime.java:625)
	at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.execExclusively(BlazeCommandDispatcher.java:649)
	at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.exec(BlazeCommandDispatcher.java:240)
	at com.google.devtools.build.lib.server.GrpcServerImpl.executeCommand(GrpcServerImpl.java:550)
	at com.google.devtools.build.lib.server.GrpcServerImpl.lambda$run$1(GrpcServerImpl.java:614)
	at io.grpc.Context$1.run(Context.java:566)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.util.concurrent.ExecutionException: com.google.common.util.concurrent.TimeoutFuture$TimeoutFutureException: Timed out: NonCancellationPropagatingFuture@4cd9776e[status=PENDING, info=[delegate=[SettableFuture@9dbe091[status=PENDING]]]]
	at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:588)
	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:567)
	at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:113)
	at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:240)
	at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.waitForBuildEventTransportsToClose(BuildEventServiceModule.java:506)
	... 11 more

I'll try to increase the timeout and see if it makes a difference (no high hopes, I guess 60s should be plenty already).

https://www.buildbuddy.io/docs/troubleshooting-slow-upload#the-build-event-protocol-upload-timed-out

We do recommend setting the BES timeout value to 600s here. I think it should help with cases like yours, but it would be better to know where in the network stack did the failure occur 🤔

Thank you @sluongng, but seems like it did not help: https://github.com/tweag/rules_haskell/actions/runs/5834070440/job/15838104954?pr=1935

I'll retry and try to get a log file. \edit: the job ran successfully on the third try.

We have recently helped a few customers troubleshoot their AWS network connectivity. Folks who connect to BuildBuddy through a NAT Gateway would often be subjected to the gateway's TCP timeout, especially when it comes to a long-running grpc stream.

It's unknown how Github is setting up the managed Windows runners, but I guess that it's also subjected to similar limitations.
https://learn.microsoft.com/en-us/azure/nat-gateway/troubleshoot-nat-connectivity#tcp-idle-timeout-timers-set-higher-than-the-default-value

On Bazel side, there are --grpc_keepalive_time and --grpc_keepalive_timeout flags that might help with these behavior.

There is also --experimental_remote_execution_keepalive but I don't think RBE is relevant here.

Also here is how you can tune the setting on Linux's sysctl https://www.buildbuddy.io/docs/troubleshooting-rbe#warning-remote-cache-unavailable-io-exception. It seems like you can achieve a similar effect on Windows https://serverfault.com/questions/735515/tcp-timeout-for-established-connections-in-windows. However, since we are discussing Github-managed runners, I think these are less relevant.