seaweedfs / seaweedfs

SeaweedFS is a fast distributed storage system for blobs, objects, files, and data lake, for billions of files! Blob store has O(1) disk seek, cloud tiering. Filer supports Cloud Drive, cross-DC active-active replication, Kubernetes, POSIX FUSE mount, S3 API, S3 Gateway, Hadoop, WebDAV, encryption, Erasure Coding.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Provide a way to allow retries in FUSE, `filer.copy`, and other operations.

eliphatfs opened this issue · comments

Describe the bug
When there are a huge number of requests, it is inevitable that the network connection may break sometime.
For now, filer.copy just prints a warning and (somehow may or may not) the whole process ends.
In either case it does not retry the failure.
I have to manually rerun the command after the previous one finishes with check.size set to reliably migrate the storage.
The error log is not informative enough to know which files have failed.
weed mount FUSE simply returns a Input/output error in such case, disrupting, for example, a machine learning training task.
These errors happens about once in a million, but they are preventing seaweed being a reliable file system we can trust in our cluster.

System Setup

  • /usr/local/bin/weed server -volume=0 -filer -dir=/weedfs
  • /usr/local/bin/weed volume -max=400 -dir=/weedfs -mserver=10.8.149.13:9333 on 8 machines different from the master
  • These commands are done by systemctl services with setups exactly as the wiki page.
  • OS version: Ubuntu 22.04
  • output of weed version: version 8000GB 3.63 54d7748 linux amd64
  • if using filer, show the content of filer.toml: no filer.toml
  • security:
[access]
ui = false

[grpc]
ca = "/etc/ariesdockerd/certs/Aries_SeaweedFS_CA.crt"

[grpc.volume]
cert = "/etc/ariesdockerd/certs/volume01.crt"
key  = "/etc/ariesdockerd/certs/volume01.key"

[grpc.master]
cert = "/etc/ariesdockerd/certs/master01.crt"
key  = "/etc/ariesdockerd/certs/master01.key"

[grpc.filer]
cert = "/etc/ariesdockerd/certs/filer01.crt"
key  = "/etc/ariesdockerd/certs/filer01.key"

[grpc.client]
cert = "/etc/ariesdockerd/certs/client01.crt"
key  = "/etc/ariesdockerd/certs/client01.key"

Expected behavior
The client retries failed requests after a back-off delay, for up to a limit, e.g. 60 secs, then declare the failure.

Screenshots
Sample kind of errors:

I0312 06:49:42.407970 upload_content.go:95 assign volume failure count:1  path:"/************/000-114/105b2927193149ceb7dec74908264fee/": rpc error: code = Unavailable desc = error reading from server: read tcp 10.8.149.3:55350->10.8.149.13:18888: read: connection timed out
    copy file error: filerGrpcAddress assign volume: rpc error: code = Unavailable desc = error reading from server: read tcp 10.8.149.3:55350->10.8.149.13:18888: read: connection timed out
 W0312 06:49:49.159889 upload_content.go:170 uploading 0 to http://10.8.149.6:8080/903,081b656ab49c19b3: upload color_sample_0005_view_0001.png 104005 bytes to http://10.8.149.6:8080/903,081b656ab49c19b3: Post "http://10.8.149.6:8080/903,081b656ab49c19b3": EOF
W0312 06:49:52.099999 upload_content.go:170 uploading 2 to http://10.8.149.2:8080/902,081b90b10c607b14: upload color_sample_0000_view_0005.png 87185 bytes to http://10.8.149.2:8080/902,081b90b10c607b14: Post "http://10.8.149.2:8080/902,081b90b10c607b14": dial tcp 10.8.149.2:8080: connect: connection refused
W0311 23:21:14.855737 upload_content.go:170 uploading 0 to http://10.8.149.7:8080/318,02df5564b8ee4e5a: unmarshalled error http://10.8.149.7:8080/318,02df5564b8ee4e5a: reject because inflight upload data 268702962 > 268435456, and wait timeout