Writing files from python intermittently fails with Permission Denied error but we do have permissions!
pshved opened this issue · comments
Describe the issue
Our workflows run on GKE and write new files to GCS bucket using gcsfuse. Sometimes (not always), creating a new file and writing to it returns a PermissionDenied error. Our permissions are configured correctly because 90% of the time these writes succeed.
In our code, we are just doing this (an oepn-source library Pillow
is doing this on our behalf)
fp = builtins.open(filename, "w+b")
fp.write(...)
fp.close()
Based on the debug logs, it seems that Python attempts to set mtime after writing the file. But something on the GCSfuse backend returns "Permission Denied". Perhaps, the outcome differs based on how quickly the write completes.
Here's an example of the failing log:
{"name":"root","levelname":"DEBUG","severity":"DEBUG","message":"fuse_debug: Op 0x0001d368 connection.go:416] \u003c- CreateFile (parent 14, name \"1691611964_5487471_mask.png\", PID 110)\n","timestampSeconds":1695326482,"timestampNanos":417034015}
...
{"name":"root","levelname":"DEBUG","severity":"DEBUG","message":"fuse_debug: Op 0x0001d368 connection.go:498] -\u003e OK (inode 2050)\n","timestampSeconds":1695326482,"timestampNanos":454704938}
{"name":"root","levelname":"DEBUG","severity":"DEBUG","message":"fuse_debug: Op 0x0001d384 connection.go:416] \u003c- unknown (inode 2050, opcode 39)\n","timestampSeconds":1695326482,"timestampNanos":454974921}
{"name":"root","levelname":"DEBUG","severity":"DEBUG","message":"fuse_debug: Op 0x0001d384 connection.go:500] -\u003e Error: \"function not implemented\"\n","timestampSeconds":1695326482,"timestampNanos":455098988}
{"name":"root","levelname":"DEBUG","severity":"DEBUG","message":"fuse_debug: Op 0x0001d386 connection.go:416] \u003c- WriteFile (inode 2050, PID 0, handle 2041, offset 0, 6584 bytes)\n","timestampSeconds":1695326482,"timestampNanos":498558969}
{"name":"root","levelname":"DEBUG","severity":"DEBUG","message":"fuse_debug: Op 0x0001d388 connection.go:416] \u003c- SetInodeAttributes (inode 2050, PID 110, mtime 2023-09-21 20:01:22.497268241 +0000 UTC)\n","timestampSeconds":1695326482,"timestampNanos":498691831}
{"name":"root","levelname":"DEBUG","severity":"DEBUG","message":"fuse_debug: Op 0x0001d388 connection.go:500] -\u003e Error: \"permission denied\"\n","timestampSeconds":1695326482,"timestampNanos":507996734}
{"name":"root","levelname":"ERROR","severity":"ERROR","message":"SetInodeAttributes: permission denied, SetMtime: UpdateObject: googleapi: Error 403: Access denied., forbidden\n","timestampSeconds":1695326482,"timestampNanos":507923961}
{"name":"root","levelname":"ERROR","severity":"ERROR","message":"fuse: *fuseops.SetInodeAttributesOp error: permission denied\n","timestampSeconds":1695326482,"timestampNanos":508007306}
...
{"name":"root","levelname":"DEBUG","severity":"DEBUG","message":"fuse_debug: Op 0x0001d386 connection.go:498] -\u003e OK ()\n","timestampSeconds":1695326482,"timestampNanos":528484773}
And here's an example of a successful log:
{"name":"root","levelname":"DEBUG","severity":"DEBUG","message":"fuse_debug: Op 0x0001d348 connection.go:416] \u003c- WriteFile (inode 2049, PID 0, handle 2040, offset 0, 179800 bytes)\n","timestampSeconds":1695326482,"timestampNanos":295767292}
{"name":"root","levelname":"DEBUG","severity":"DEBUG","message":"fuse_debug: Op 0x0001d34a connection.go:416] \u003c- SetInodeAttributes (inode 2049, PID 110, mtime 2023-09-21 20:01:22.294249624 +0000 UTC)\n","timestampSeconds":1695326482,"timestampNanos":295856402}
{"name":"root","levelname":"DEBUG","severity":"DEBUG","message":"fuse_debug: Op 0x0001d348 connection.go:498] -\u003e OK ()\n","timestampSeconds":1695326482,"timestampNanos":320844681}
{"name":"root","levelname":"DEBUG","severity":"DEBUG","message":"fuse_debug: Op 0x0001d34a connection.go:498] -\u003e OK ()\n","timestampSeconds":1695326482,"timestampNanos":320908858}
I see the same patterns across our workflows: there is no error when the write is quick (and the timestamps are ordered like WriteFile
returns first, and SetInodeAttributes
" second). When WriteFile
command takes longer, the following SetInodeAttributes
runs before WriteFile
actually completes on the GCSfuse side.
We are running gcsfuse as gcsfuse --implicit-dirs --max-conns-per-host=100 foo bar
and the logs above are obtained via gcsfuse --implicit-dirs --max-conns-per-host=100 --foreground --debug_fuse foo bar &
Of course, networking will always incur transient errors. In this case, however, the errors seem to be a result of a natural intermittent slowness of a distributed system combined with expectation on the python / gcsfuse side on how the syscalls would behave that might be different on GCSfuse than on other systems. As a user of Python API, I expect that doing a simple open and write to a file using only default attributes would succeed in absence of network errors or partitions.
Any ways we can solve / mitigate this issue? Thanks
System (please complete the following information):
- OS: Ubuntu 22.04
- Platform Kubernetes
- Version gcsfuse version 1.1.0 (Go version go1.20.5)
Additional context
Add any other context about the problem here.
SLO:
24 hrs to respond and 7 days to close the issue.
I can confirm via strace
that Python doesn't run any syscall that would set the mtime, which makes me think its invocation is caused by something in the GCSfuse implementation.
[pid 1228915] openat(AT_FDCWD, "tmp/our_file.png", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0666 <unfinished ...>
[pid 1228915] <... openat resumed>) = 28 <0.000069>
[pid 1228915] fstat(28, <unfinished ...>
[pid 1228915] <... fstat resumed>{st_mode=S_IFREG|0644, st_size=0, ...}) = 0 <0.000044>
[pid 1228915] ioctl(28, TCGETS <unfinished ...>
[pid 1228915] <... ioctl resumed>, 0x7fffdf633f60) = -1 ENOTTY (Inappropriate ioctl for device) <0.000044>
[pid 1228915] lseek(28, 0, SEEK_CUR <unfinished ...>
[pid 1228915] <... lseek resumed>) = 0 <0.000026>
[pid 1228915] lseek(28, 0, SEEK_CUR <unfinished ...>
[pid 1228915] <... lseek resumed>) = 0 <0.000028>
[pid 1228915] lseek(28, 0, SEEK_CUR <unfinished ...>
[pid 1228915] <... lseek resumed>) = 0 <0.000027>
[pid 1228915] write(28, "\377\330\377\340\0\20JFIF\0\1\1\0\0\1\0\1\0\0\377\333\0C\0\1\1\1\1\1\1\1"..., 65510 <unfinished ...>
[pid 1228915] <... write resumed>) = 65510 <0.000091>
[pid 1228915] write(28, "\265\317`&\325\220J\333\231\230\342E]\255N\362\4\221\354\216\35\252\374\356e\333_;\337\374a\202"..., 65532 <unfinished ...>
[pid 1228915] <... write resumed>) = 65532 <0.000079>
[pid 1228915] write(28, "\236\304<6\227\372\235\214\372t\37\331w\32\224{\355t\365\217\345\371\276m\252\315\376\317\360\325/\211"..., 44654 <unfinished ...>
[pid 1228915] <... write resumed>) = 44654 <0.000065>
[pid 1228915] lseek(28, 0, SEEK_CUR <unfinished ...>
[pid 1228915] <... lseek resumed>) = 175696 <0.000031>
[pid 1228915] lseek(28, 0, SEEK_CUR <unfinished ...>
[pid 1228915] <... lseek resumed>) = 175696 <0.000025>
[pid 1228915] close(28 <unfinished ...>
[pid 1228915] <... close resumed>) = 0 <0.000049>
Hi @pshved ,
Thanks for reaching out to us.
Please share some details to reproduce the issue.
- Number of files you are writing.
- The size of file.
- Are you trying to write concurrently on the same file?
- Full logs with enabling --debug_gcs --debug_fuse --debug_fs --log-file=log.txt --log-foramt=text
- If possible, can you please share your Python code?
Thanks,
Tulsi Shah
Hi Tulsi, thank you for your response. So I've tried to compile the unreleased version of gcsfuse from sources (commit a082138a
), and the problem disappeared. Looking at the code, I see that the sequence of operations has changed in the way new files are opened.
It'll take me a few days to produce an example, but I'll try to if you're still interested or if the problem reappears.
Answering your questions,
- It happens when I'm writing 1 file or hundreds of files alike.
- The file sizes range from kilobytes to a few megabytes. In fact, we've only had this experience with small files; large files get written without issues.
- No, only sequentially
Thank you for letting us know about this issue, @pshved!
I am glad to hear that the issue is not occurring in the latest version. I would like to inform you that we have released gcsfuse v1.2.0. You can upgrade to this version.
For now, we are closing this request. Please feel free to reopen the issue if you encounter the problem again.
Thank you!