buildbuddy-io / buildbuddy

BuildBuddy is an open source Bazel build event viewer, result store, remote cache, and remote build execution platform.

Home Page:https://buildbuddy.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Goroutines leak?

kusaeva opened this issue · comments

Hello!
We are using buildbuddy on-prem with bazel-remote as remote cache.
Looks like there is a leak of goroutines while getting errors from remote cache.

What we see
We see an increase of goroutines in Grafana, while there is a large number of errors in the logs
showing some problem with connection to remote cache.

   stderr   2023/10/11 15:21:27.277 ERR rpc error: code = Unavailable desc = Failed to stream to blobstore for path 78ed3b49-f03c-4399-8353-b292ec17001c/artifacts/cache/blobs/99df8b918b0881ac02d088af19640bdbc990ca7fdebb1374ba919e33a4326532/40442 to persist cache artifact at bytestream://bazel-remote-grpc.vrts-slb.company.net:80/blobs/99df8b918b0881ac02d088af19640bdbc990ca7fdebb1374ba919e33a4326532/40442: failed to read byte stream resource "bytestream://bazel-remote-grpc.vrts-slb.company.net:80/blobs/99df8b918b0881ac02d088af19640bdbc990ca7fdebb1374ba919e33a4326532/40442" invocation_id=78ed3b49-f03c-4399-8353-b292ec17001c   
2023-10-11 18:21:27.278	
   stderr   2023/10/11 15:21:27.278 WRN Error byte-streaming from "bytestream://bazel-remote-grpc.vrts-slb.company.net:80/blobs/f30cc3c243a9048a5c475a5864d43bf2cb02301bdccbba9e1bea65ec6f33fda7/40443": rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp [2a02:6b8:0:3400::4d5]:80: connect: connection refused"   
2023-10-11 18:21:27.279	
   stderr   2023/10/11 15:21:27.278 ERR rpc error: code = Unavailable desc = Failed to stream to blobstore for path 78ed3b49-f03c-4399-8353-b292ec17001c/artifacts/cache/blobs/f30cc3c243a9048a5c475a5864d43bf2cb02301bdccbba9e1bea65ec6f33fda7/40443 to persist cache artifact at bytestream://bazel-remote-grpc.vrts-slb.company.net:80/blobs/f30cc3c243a9048a5c475a5864d43bf2cb02301bdccbba9e1bea65ec6f33fda7/40443: failed to read byte stream resource "bytestream://bazel-remote-grpc.vrts-slb.company.net:80/blobs/f30cc3c243a9048a5c475a5864d43bf2cb02301bdccbba9e1bea65ec6f33fda7/40443" invocation_id=78ed3b49-f03c-4399-8353-b292ec17001c   
2023-10-11 18:21:27.279	
   stderr   2023/10/11 15:21:27.278 WRN Error byte-streaming from "bytestream://bazel-remote-grpc.vrts-slb.company.net:80/blobs/9324d9efded049d8e3869f6fe80658468041b60a815358e34bc415789b49b9e7/3924": rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp [2a02:6b8:0:3400::4d5]:80: connect: connection refused"   
2023-10-11 18:21:27.279

Sometimes there may be more than one burst and then all available memory usually exhausted and the application crashes (stops responding to health checks).
I am confused that goroutines' count never returns to the previous level so I suspect a leak.

Could you check it up please?

Additional context
Buildbuddy v2.25.0 (but also reproducing with v2.21.1)

Screenshots
buildbuddy
buildbuddy (1)

Hey @kusaeva,

Can you please grab a profile from http://localhost:9090/debug/pprof/goroutine?debug=1 (replace localhost with your server) and share it so we can see what these goroutines are doing?

Unrelated to the leak issue, for simplicity and performance I would probably use the cache functionality included with BuildBuddy rather than a separate cache. It will give you better tooling (see all cache requests) and the performance is on-par with other solutions.

Cheers.

Hey @tylerwilliams,
thanks for the quick response!
Here a profile from an instance with enormous count of goroutines
goroutine.txt

It seems like 351/454 goroutines are spent on aws S3 blobstore uploads.
How are you configuring S3 storage: in your deployment? Could you please share the YAML?

I am also assuming you are NOT using AWS S3 and some other alternatives. We need to check whether that alternative supports the apis used in aws-sdk-go-v2.

Hey, here is our yaml:
config.yaml.txt
You are right, we do not use AWS S3 and use our alternative that partly compatible with API AWS S3.

So the theory is that it works well up to a certain point and breaks down when trying to upload something via an unsupported method? Could you give me some hints what methods of the API should we check first?

I suspect the BuildBuddy server isn't able to contact your cache based on this error:

transport: Error while dialing: dial tcp [2a02:6b8:0:3400::4d5]:80: connect: connection refused

It tries to make grpc requests to your cache in order to persist timing profiles and test logs for longer term storage.

You can disable this feature by setting the config option storage.disable_persist_cache_artifacts to true which would disable copying test logs and timing profiles into S3 (blobstore).

The alternative would be to figure out why the connection is getting refused from the BuildBuddy app to your cache.

Separately, it seems like (as you mentioned) there is a leak where something doesn't get properly cleaned up when we hit this error case: https://github.com/buildbuddy-io/buildbuddy/blob/master/server/build_event_protocol/build_event_handler/build_event_handler.go#L504

Thank you for a workaround, but unfortunately logs from older invocations are too important for us so we would not like to disable the feature.

I will try to investigate why remote cache sometimes refuses connections (maybe it's overflowed by requests from invocations with too many actions?.. Maybe we can configure concurrency from buildbuddy at this place?).

But it will be great if such temporal unavailability of remote cache does not kill buildbuddy :)

We think this change should fix the leak issue: #5008

It will go live in next week's release. Please re-open the issue if you're still encountering issues after upgrading.