GCSFuse process crashing without logging anything

Question

GCSFuse process crashing without logging anything

davidkrikava opened this issue 6 months ago · comments

System:
Google VM instance (E2 machine type)
CentOS Linux release 7.9.2009
GCFuse version: 1.4.0 (issue was on 1.3.0 also)
Bucket in use: regional (mounted from two VMs from different zones within that region)

mount options:
bucketname /mount/path _netdev,noatime,rw,stat_cache_capacity=81920,max_retry_sleep=0,stat_cache_ttl=5m0s,type_cache_ttl=30m0s,client_protocol=http2,max_conns_per_host=200,allow_other,file_mode=0775,dir_mode=0775,uid=XXX,gid=XXX,log_file=/var/log/gcsfuse.log,debug_fuse,debug_gcs,debug_mutex

VM resources:
during crashes were nowhere near exhausting VMs resources
RAM: ~30% used
cpu: ~10% used

Sizing, req rates:
bucket size ~ 5,5T
object count ~ 1,3M
req rate during crashes ~ 100/s
(above info taken from google metrics explorer from the bucket)

Issue description:
We have two vms (01 and 02) that have regional bucket mounted via gcsfuse. In that bucket we have two folders 01 and 02 (and then a subtree of folders/files), machines write into their folders respectively, but read from both.
Both machines also cleanup their respective folders every morning and we think this cleanup is whats causing the crash as the time of the crashes correlates with the high req count.

Cleanup command (yes this is not optimal, we should use bucket lifecycle instead, but we switched to cron/find script long time ago for some reason):
find /mountpoint/01/ -mtime +20 -type f | while read path; do rm "$path"; done
First occurrence: 25. Dec 2023 (version 1.3.0, continue happening with 1.4.0 version also)
Just the find part is enough to cause gcsfuse crash.
What actually happens: gcsfuse process crashes, no longer running, can be fixed by first unmounting and then mounting again

Logs:
As I posted in "mount options" - we are using several debug options, yet when the crash occurs there is nothing logged.
Log file ends with last successful transaction.
Same with syslog, there is nothing logged, only first failed access:
Jan 10 10:56:37 server-01 collectd[1440]: statvfs(/XXX) failed: Transport endpoint is not connected
Nothing can be found in dmesg either.

To sum up:
I am not aware of reaching any quotas (maybe there are some I am not aware ? - tho i would expect to have errors logged instead of process cashing in such case), this process has been running without changes (and previously until that date without issues) for the past 7 months.
What can we do, to try and figure out the root cause of these crashes, since there is nothing logged anywhere?
Thank you for your help

Prince Kumar · Answer 1 · Wed Jan 10 2024 21:46:59 GMT+0800 (China Standard Time)

Thanks @davidkrikava for reaching out and detailed description!

I am looking into this, in the mean time I need one clarification. Do you see this issue after every 20hrs (-mtime +20)? In other words, is it 100% reproducible?

davidkrikava · Answer 2 · Wed Jan 10 2024 22:07:39 GMT+0800 (China Standard Time)

Unfortunately it is not, so far we had crashes
25. Dec, 27. Dec, 5. Jan, 9. Jan, 10. Jan - all during the higher req rate caused by the find command (but the command runs every day)
Then second one 10. Jan, when i tried to run the find command manually.

Here one can see the diff, when the command runs vs normal traffic (last 7 days)
Thank you for swift response.

Prince Kumar · Answer 3 · Thu Jan 11 2024 18:58:37 GMT+0800 (China Standard Time)

Hi @davidkrikava, thanks for the patience!

GCSFuse doesn't log when killed externally, but it does log stack traces for crashes within the program in most of the cases.

GCSFuse logs the reason for crash and stack-trace if it is mounted in foreground mode (use --foreground for this). So, could you please try mounting GCSFuse in foreground mode?

When it is mounted in background mode (by default), it logs the crashes info other than crash in jacobsa/fuse (one of the core dependencies of gcsfuse).

In the mean, time I tried reproducing this but couldn't succeeded yet. If you could provide the minimize steps to reproduce this issue, that would be good for us.

Also, you can consider option to clean the object using storage_lifecyle_delete and see if this comes with that too.

Regards,
Prince Kumar.

davidkrikava · Answer 4 · Thu Jan 11 2024 21:11:55 GMT+0800 (China Standard Time)

Hi,
well unfortunately I cannot provide 100% working steps to reproduce it, but so far, I was able to bring it down once myself like this:
(also please consider the sizing from original post)
Try mounting regional bucket from two zones (two VMs)
Run find command from both VMs simultaneously (with some deletes)
"Hope" for crash.

At this point I remounted with --foreground as you suggested and trying to bring it down myself to see if there will be something logged.
Running the gcfuse like this: gcsfuse --foreground --stat-cache-ttl 5m0s --type-cache-ttl 30m0s -o allow_other --dir-mode 0775 --uid XXX --debug_mutex -o rw -o noatime --gid XXX --debug_fuse --debug_gcs --max-retry-sleep 0 --max-conns-per-host 200 --log-file /var/log/gcsfuse.log --stat-cache-capacity 81920 --client-protocol http2 --file-mode 0775 bucket-name /our/path &
Thanks

davidkrikava · Answer 5 · Mon Jan 15 2024 21:28:28 GMT+0800 (China Standard Time)

Hi,
today i finally got lucky and managed to crash gcfuse and captured an output with whole stacktrace
so I ran the command as i described:

gcsfuse --foreground --stat-cache-ttl 5m0s --type-cache-ttl 30m0s -o allow_other --dir-mode 0775 --uid XXX --debug_mutex -o rw -o noatime --gid XXX --debug_fuse --debug_gcs --max-retry-sleep 0 --max-conns-per-host 200 --log-file /var/log/gcsfuse.log --stat-cache-capacity 81920 --client-protocol http2 --file-mode 0775 bucket-name /our/path

leaving it in a screen, so if the process crashed, I would get the output in the command line - and then i tried running those cleanup commands (finds with delete)
The output i got was -

fatal error: concurrent map iteration and map write

whole trace i captured:
gcsfuse_crash_stacktrace.txt

Ayush Sethi · Answer 6 · Wed Jan 17 2024 13:42:11 GMT+0800 (China Standard Time)

@davidkrikava Thank you for sharing the crash logs. That was helpful and we are able to reproduce the issue on our side. We are working on next steps/fix and will keep you updated on the same.

Ayush Sethi · Answer 7 · Wed Jan 17 2024 20:15:33 GMT+0800 (China Standard Time)

@davidkrikava The issue mostly happens when listing and creation of new file is done concurrently on the same mount. While we fix this, we suggest you to consider doing the following to avoid this issue:

Check if listing and creation of new files/(other operations on newly created files) are happening concurrently and avoid that if possible.
E.g. For the command mentioned in #1603 (comment), you can try to split it into two parts to avoid concurrent list and deletion case:

find /mountpoint/01/ -mtime +20 -type f > files.txt 
cat files.txt | while read path; do rm "$path"; done

Thanks for your patience and please let us know if the above works for you.

Ashmeen Kaur · Answer 8 · Mon Jan 22 2024 21:12:30 GMT+0800 (China Standard Time)

Hi @davidkrikava,
just wanted to follow up - were you able to use the mitigation we suggested in #1603 (comment)?

davidkrikava · Answer 9 · Mon Jan 29 2024 19:23:06 GMT+0800 (China Standard Time)

Hello,
sorry for late response, I have been on a vacation. We will generally move towards using google's native bucket lifecycle policy, as that should be the correct path anyway.
I would like to just clear out what you have written, if I understand it correctly.

The issue mostly happens when listing and creation of new file is done concurrently on the same mount.

By this I understand that from same mount (in my case same machine) I trigger the crash if I create a file and list that same file at the same time (by "same time" i presume while the file is being created/uploaded ?)
But in my case I am listing already created files and deleting them - that should not cause issues.

Although as I am listing through the bucket, there are files being created by other processes and while going through all the files within this cleanup I may list a file being created.
If this is the case, I do believe, that what you have suggested would not make things better, as the listing (find within the cleanup) is the cause of the issues, not the remove. And still the lifecycle will be overall better solution.

Given all that I still do believe that this is a valid issue, that should be corrected, since this can happen to anyone even without running find over the whole bucket :)
Thanks

Ayush Sethi · Answer 10 · Mon Jan 29 2024 21:54:56 GMT+0800 (China Standard Time)

@davidkrikava Thank you for the detailed response !

By this I understand that from same mount (in my case same machine) I trigger the crash if I create a file and list that same file at the same time

It doesn't need to be the same file. "Listing in the mount" and "creation of files in the mount" concurrently can cause this issue.

Although as I am listing through the bucket, there are files being created by other processes and while going through all the files within this cleanup I may list a file being created.

Got it, I think this is the real cause i.e. listing and creation being performed concurrently in same mount. While we make permanent fix for this, we suggest you to avoid this scenario at your end if possible.

If this is the case, I do believe, that what you have suggested would not make things better, as the listing (find within the cleanup) is the cause of the issues, not the remove.

This makes sense. We were not aware that listing and create operations are being performed at your end concurrently (as the original issue was only created around find /mountpoint/01/ -mtime +20 -type f | while read path; do rm "$path"; done). So we thought listing + deletion might be the cause and wanted to avoid that, hence that suggestion.

Given all that I still do believe that this is a valid issue, that should be corrected, since this can happen to anyone even without running find over the whole bucket :)

We are working towards the fix and update this issue once fixed.

Thanks again !

davidkrikava · Answer 11 · Mon Jan 29 2024 23:26:31 GMT+0800 (China Standard Time)

Yes, I thought as well that the delete was the issue, but it seems that the excessive listing by the find combined with regular usage (we have nginx root serving files from there - creating and listing) is the issue.

We are moving towards the bucket lifecycle.

Yes than we will definitely update once the fix is released, as IMO the listing greatly increased the chances of a crash, but since we do create files there and read them, this can theoretically happen at anytime now.

I now consider my participation in this case finished, but do contact me, if you need anything else in the future.
Thank you