GoogleCloudPlatform / gcsfuse

Describe the issue
We are using GCS FUSE CSI Driver in a GKE setup for handling huge amount of data. So we are reading multi-million files summarizing to more than 8 TB of size in ML training. This works well for the first 1 hour or so. We always see the following log output:

Starting a garbage collection run.

followed by

Garbage collection succeeded after deleted 0 objects in 10.43209ms.

That indicates that the garbage collection succeeded successfully. The garbage collection runs every 10 minutes and usually does not delete any stale objects.

Then after around an hour of smooth operations we suddenly only see that garbage collection was started, but we cannot see that it ever stops. Once we see the log message, our whole workload freezes. No data is delivered anymore to our process.

It is hard to tell if the freeze occurs because of the garbage collection or if not seeing the garbage collection to finish is only a symptom for an underlying problem, because the complete GCS FUSE process has frozen.

The GCS FUSE sidecar container in which GCS FUSE is running can take as many CPU and memory resources as required - there is no limit set and the node for sure has enough memory available.

We can reproduce the issue on every training.

To Collect more Debug logs
Steps to reproduce the behavior:

Please make sure you have no other security, monitoring, background processes which can offend the FUSE process running. Possibly reproduce under a fresh/clean installation.
Please rerun with --debug_fuse --debug_fs --debug_gcs --debug_http --foreground as additional flags to enable debug logs.
Monitor the logs and please capture screenshots or copy the relevant logs to a file (can use --log-format and --log-file as well).
Attach the screenshot or the logs file to the bug report here.
If you're using gcsfuse with any other library/tool/process please list out the steps you took to reproduce the issue.

We have run the workload with debug logs enabled. If required, we can provide the logs. According to us there are no error indications in the logs. There is no error before the workflow just stops working again.

System (please complete the following information):

Platform GKE
Version 2.0.0

Additional context
Add any other context about the problem here.

SLO:
We strive to respond to all bug reports within 24 business hours.

Could we try with a higher max-conns-per-host value. ( the default is 100, can try setting it to 1024 )

As per troubleshooting document, It does seem similar to one of the scenario.
Cloud Storage FUSE get stuck when using it to concurrently work with a large number of opened files

This happens when gcsfuse is mounted with http1 client (default) and the application using gcsfuse tries to keep more than value of --max-conns-per-host number of files opened. You can try (a) Passing a value higher than the number of files you want to keep open to --max-conns-per-host flag. (b) Adding some timeout for http client connections using --http-client-timeout flag.

Hi @charith87,

thanks a lot for your quick answer. We will instantly try out with a higher value for max-conns-per-host. We could meanwhile validate that the garbage collection is not the cause of the problem, but just a symptom

Yeah, it's more like a symptom. As GCSFuse doesn't get the response of ListCall (which is done as part of Garbage collection job), which strongly suggests some issue with the GCS connection.

Hi @charith87,

we tried again yesterday with "--max-conns-per-host number" set to 10,000 for the GCS CSI FUSE Driver. That setting was picked up, but unfortunately the run failed again with the same problem - it simply stopped working without delivering any error message again.

Any idea what else we can do?

Best regards

Rob Ulbrich

We are investigating this our side, we will get back to you.

We could meanwhile solve our problem by providing the correct configurations.

Summarizing the results of the investigation and resolution in case others experience similar issues:

TLDR

Application freeze was due to a workload that was opening more than 10k file handles simultaneously, exceeding the default limit of 100 we set in GCSfuse per --max-conns-per-host. We will increase this default in an upcoming future release.
If your workload keeps less than 1 million files opened at a time then consider using --max-conns-per-host=0
If you are unsure of the number of files that your workload can keep open at a time then consider using both --max-conns-per-host=0 and --http-client-timeout=Xs
For multiple random reads of the same file, enable the gcsfuse file cache with file-cache:cache-file-for-range-read

Root cause

The root cause of the issue was due to having a workload that was opening more than 10k file handles simultaneously, exceeding the default limit of 100 we set in GCSfuse per --max-conns-per-host.
Switching the client protocol used by GCSfuse to communicate back to GCS from HTTP1 (default) to HTTP2 via the --client-protocol flag, resolved the issue as HTTP1 is not subject to a connections limit. However, this resulted in a performance drop associated with using HTTP2 vs HTTP1 (default)

Solution

Setting --max-conns-per-host to 0. This effectively puts no limit on the number of open network connections from a GCSfuse perspective, however, the number of open network connections are limited by maximum open-file handles that the OS allows. In this case it was 1 million
Utilizing the --http-client-timeout flag with a value of 5 seconds. This flag closes the network connection after 5s which breaks the hold over the connections and resolves the freeze issue regardless of the value assigned to --max-conns-per-host. The value used, in this case 5s, is dependent on the size of the I/O operation. If the timeout set is too small to complete an operation, then an infinite retry loop of closing and opening connections occurs, causing the application to freeze. Internal testing showed that 5 seconds is adequate for doing small random reads on files of 5 MiBs (assuming compute/storage regional co-location). Workloads that do larger I/O operations (in MiBs range) would need a higher value set

@marcoa6 I had a different effect. E.g. it was repeatedly freezing without specify --max-conns-per-host and instead it is working with --max-conns-per-host=100.

This seems strange. As the default value of max-conns-per-host is 100 only. (Ref:

gcsfuse/flags.go

Line 264 in 7742c68

Value: 100,

)

Could you please create a different ticket with more details to reproduce this issue, if possible?

I don't know if I need to open it here or not cause I am using it with Google CSI driver.
But also there they are using the new default:
GoogleCloudPlatform/gcs-fuse-csi-driver@4ae8725

I am running into this same issue in Cloud Run. I have a job that stales immediately upon running this job with the same log output of Starting a garbage collection run. followed by Garbage collection succeeded after deleted 0 objects in 10.43209ms repeated every 10 minutes.

When I ran this exact same job just one month ago, I did not encounter this issue.

I have tried a few cases to try get to the bottom of this.

A job in which a thousands of files are read in. Results in the garbage collection staling as stated above. I suspect this is due to the default command flag --max-conns-per-host=100 as stated above.
A job in which a 3 gb file is read, and less than 100 files are read in. Also results in the garbage collection staling. I suspect this staling is due to the default config flag --max-file-size-mb=100.
A job in which >100 files are read in, each >512 mb. This job runs successfully. This strengthens my hypotheses for the other two jobs.

What further complicates this is that in Cloud run, there is no support for gcsfuse flags. When I configure volume mount portion of the the yaml file for the job as such:

...
  volumes:
  - csi:
      driver: gcsfuse.run.googleapis.com
      volumeAttributes:
        bucketName: target-selection-pipeline
        mountOptions: "max-conns-per-host=0"
    name: volume

I am met with the following error:

Job failed to deploy                                                                                                                                                                                                    
ERROR: (gcloud.beta.run.jobs.replace) INVALID_ARGUMENT: spec.execution_template.spec.task_spec.volumes[0]: gcsfuse flags are not supported for Cloud Storage volumes.
- '@type': type.googleapis.com/google.rpc.BadRequest
  fieldViolations:
  - description: gcsfuse flags are not supported for Cloud Storage volumes.
    field: spec.execution_template.spec.task_spec.volumes[0]

Is there a solution to this problem for Cloud Run? Or will I have to deploy this via GKE or Compute Engine? Again, this is a new issue that didn't happen when I ran the exact same job one month ago on April 6 2024.

@burkelawlor please open a support request for the Cloud Run team, as passing mount options are currently available only through allowlist access.

#2 seems odd as --max-file-size-mb is related to log files

GCS FUSE stales at Garbage Collection