GoogleCloudPlatform / gcsfuse

A user-space file system for interacting with Google Cloud Storage

Home Page:https://cloud.google.com/storage/docs/gcs-fuse

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Writing files from python intermittently fails with Permission Denied error but we do have permissions!

pshved opened this issue · comments

Describe the issue
Our workflows run on GKE and write new files to GCS bucket using gcsfuse. Sometimes (not always), creating a new file and writing to it returns a PermissionDenied error. Our permissions are configured correctly because 90% of the time these writes succeed.

In our code, we are just doing this (an oepn-source library Pillow is doing this on our behalf)

fp = builtins.open(filename, "w+b")
fp.write(...)
fp.close()

Based on the debug logs, it seems that Python attempts to set mtime after writing the file. But something on the GCSfuse backend returns "Permission Denied". Perhaps, the outcome differs based on how quickly the write completes.

Here's an example of the failing log:

{"name":"root","levelname":"DEBUG","severity":"DEBUG","message":"fuse_debug: Op 0x0001d368        connection.go:416] \u003c- CreateFile (parent 14, name \"1691611964_5487471_mask.png\", PID 110)\n","timestampSeconds":1695326482,"timestampNanos":417034015}
...
{"name":"root","levelname":"DEBUG","severity":"DEBUG","message":"fuse_debug: Op 0x0001d368        connection.go:498] -\u003e OK (inode 2050)\n","timestampSeconds":1695326482,"timestampNanos":454704938}                                                                                 
{"name":"root","levelname":"DEBUG","severity":"DEBUG","message":"fuse_debug: Op 0x0001d384        connection.go:416] \u003c- unknown (inode 2050, opcode 39)\n","timestampSeconds":1695326482,"timestampNanos":454974921}                                                                 
{"name":"root","levelname":"DEBUG","severity":"DEBUG","message":"fuse_debug: Op 0x0001d384        connection.go:500] -\u003e Error: \"function not implemented\"\n","timestampSeconds":1695326482,"timestampNanos":455098988}                                                             
{"name":"root","levelname":"DEBUG","severity":"DEBUG","message":"fuse_debug: Op 0x0001d386        connection.go:416] \u003c- WriteFile (inode 2050, PID 0, handle 2041, offset 0, 6584 bytes)\n","timestampSeconds":1695326482,"timestampNanos":498558969}        
{"name":"root","levelname":"DEBUG","severity":"DEBUG","message":"fuse_debug: Op 0x0001d388        connection.go:416] \u003c- SetInodeAttributes (inode 2050, PID 110, mtime 2023-09-21 20:01:22.497268241 +0000 UTC)\n","timestampSeconds":1695326482,"timestampNanos":498691831}
{"name":"root","levelname":"DEBUG","severity":"DEBUG","message":"fuse_debug: Op 0x0001d388        connection.go:500] -\u003e Error: \"permission denied\"\n","timestampSeconds":1695326482,"timestampNanos":507996734}                                                                    
{"name":"root","levelname":"ERROR","severity":"ERROR","message":"SetInodeAttributes: permission denied, SetMtime: UpdateObject: googleapi: Error 403: Access denied., forbidden\n","timestampSeconds":1695326482,"timestampNanos":507923961}
{"name":"root","levelname":"ERROR","severity":"ERROR","message":"fuse: *fuseops.SetInodeAttributesOp error: permission denied\n","timestampSeconds":1695326482,"timestampNanos":508007306} 
...
{"name":"root","levelname":"DEBUG","severity":"DEBUG","message":"fuse_debug: Op 0x0001d386        connection.go:498] -\u003e OK ()\n","timestampSeconds":1695326482,"timestampNanos":528484773}

And here's an example of a successful log:

{"name":"root","levelname":"DEBUG","severity":"DEBUG","message":"fuse_debug: Op 0x0001d348        connection.go:416] \u003c- WriteFile (inode 2049, PID 0, handle 2040, offset 0, 179800 bytes)\n","timestampSeconds":1695326482,"timestampNanos":295767292}    
{"name":"root","levelname":"DEBUG","severity":"DEBUG","message":"fuse_debug: Op 0x0001d34a        connection.go:416] \u003c- SetInodeAttributes (inode 2049, PID 110, mtime 2023-09-21 20:01:22.294249624 +0000 UTC)\n","timestampSeconds":1695326482,"timestampNanos":295856402}         
{"name":"root","levelname":"DEBUG","severity":"DEBUG","message":"fuse_debug: Op 0x0001d348        connection.go:498] -\u003e OK ()\n","timestampSeconds":1695326482,"timestampNanos":320844681}                                                                                           
{"name":"root","levelname":"DEBUG","severity":"DEBUG","message":"fuse_debug: Op 0x0001d34a        connection.go:498] -\u003e OK ()\n","timestampSeconds":1695326482,"timestampNanos":320908858}    

I see the same patterns across our workflows: there is no error when the write is quick (and the timestamps are ordered like WriteFile returns first, and SetInodeAttributes" second). When WriteFile command takes longer, the following SetInodeAttributes runs before WriteFile actually completes on the GCSfuse side.

We are running gcsfuse as gcsfuse --implicit-dirs --max-conns-per-host=100 foo bar and the logs above are obtained via gcsfuse --implicit-dirs --max-conns-per-host=100 --foreground --debug_fuse foo bar &

Of course, networking will always incur transient errors. In this case, however, the errors seem to be a result of a natural intermittent slowness of a distributed system combined with expectation on the python / gcsfuse side on how the syscalls would behave that might be different on GCSfuse than on other systems. As a user of Python API, I expect that doing a simple open and write to a file using only default attributes would succeed in absence of network errors or partitions.

Any ways we can solve / mitigate this issue? Thanks

System (please complete the following information):

  • OS: Ubuntu 22.04
  • Platform Kubernetes
  • Version gcsfuse version 1.1.0 (Go version go1.20.5)

Additional context
Add any other context about the problem here.

SLO:
24 hrs to respond and 7 days to close the issue.

I can confirm via strace that Python doesn't run any syscall that would set the mtime, which makes me think its invocation is caused by something in the GCSfuse implementation.

[pid 1228915] openat(AT_FDCWD, "tmp/our_file.png", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0666 <unfinished ...>
[pid 1228915] <... openat resumed>)     = 28 <0.000069>
[pid 1228915] fstat(28,  <unfinished ...>
[pid 1228915] <... fstat resumed>{st_mode=S_IFREG|0644, st_size=0, ...}) = 0 <0.000044>
[pid 1228915] ioctl(28, TCGETS <unfinished ...>
[pid 1228915] <... ioctl resumed>, 0x7fffdf633f60) = -1 ENOTTY (Inappropriate ioctl for device) <0.000044>
[pid 1228915] lseek(28, 0, SEEK_CUR <unfinished ...>
[pid 1228915] <... lseek resumed>)      = 0 <0.000026>
[pid 1228915] lseek(28, 0, SEEK_CUR <unfinished ...>
[pid 1228915] <... lseek resumed>)      = 0 <0.000028>
[pid 1228915] lseek(28, 0, SEEK_CUR <unfinished ...>
[pid 1228915] <... lseek resumed>)      = 0 <0.000027>
[pid 1228915] write(28, "\377\330\377\340\0\20JFIF\0\1\1\0\0\1\0\1\0\0\377\333\0C\0\1\1\1\1\1\1\1"..., 65510 <unfinished ...>
[pid 1228915] <... write resumed>)      = 65510 <0.000091>
[pid 1228915] write(28, "\265\317`&\325\220J\333\231\230\342E]\255N\362\4\221\354\216\35\252\374\356e\333_;\337\374a\202"..., 65532 <unfinished ...>
[pid 1228915] <... write resumed>)      = 65532 <0.000079>
[pid 1228915] write(28, "\236\304<6\227\372\235\214\372t\37\331w\32\224{\355t\365\217\345\371\276m\252\315\376\317\360\325/\211"..., 44654 <unfinished ...>
[pid 1228915] <... write resumed>)      = 44654 <0.000065>
[pid 1228915] lseek(28, 0, SEEK_CUR <unfinished ...>
[pid 1228915] <... lseek resumed>)      = 175696 <0.000031>
[pid 1228915] lseek(28, 0, SEEK_CUR <unfinished ...>
[pid 1228915] <... lseek resumed>)      = 175696 <0.000025>
[pid 1228915] close(28 <unfinished ...>
[pid 1228915] <... close resumed>)      = 0 <0.000049>

Hi @pshved ,

Thanks for reaching out to us.
Please share some details to reproduce the issue.

  1. Number of files you are writing.
  2. The size of file.
  3. Are you trying to write concurrently on the same file?
  4. Full logs with enabling --debug_gcs --debug_fuse --debug_fs --log-file=log.txt --log-foramt=text
  5. If possible, can you please share your Python code?

Thanks,
Tulsi Shah

Hi Tulsi, thank you for your response. So I've tried to compile the unreleased version of gcsfuse from sources (commit a082138a), and the problem disappeared. Looking at the code, I see that the sequence of operations has changed in the way new files are opened.

It'll take me a few days to produce an example, but I'll try to if you're still interested or if the problem reappears.

Answering your questions,

  1. It happens when I'm writing 1 file or hundreds of files alike.
  2. The file sizes range from kilobytes to a few megabytes. In fact, we've only had this experience with small files; large files get written without issues.
  3. No, only sequentially

Thank you for letting us know about this issue, @pshved!

I am glad to hear that the issue is not occurring in the latest version. I would like to inform you that we have released gcsfuse v1.2.0. You can upgrade to this version.

For now, we are closing this request. Please feel free to reopen the issue if you encounter the problem again.

Thank you!