Provide a way to allow retries in FUSE, `filer.copy`, and other operations.

Question

Provide a way to allow retries in FUSE, `filer.copy`, and other operations.

eliphatfs opened this issue 3 months ago · comments

Describe the bug
When there are a huge number of requests, it is inevitable that the network connection may break sometime.
For now, filer.copy just prints a warning and (somehow may or may not) the whole process ends.
In either case it does not retry the failure.
I have to manually rerun the command after the previous one finishes with check.size set to reliably migrate the storage.
The error log is not informative enough to know which files have failed.
weed mount FUSE simply returns a Input/output error in such case, disrupting, for example, a machine learning training task.
These errors happens about once in a million, but they are preventing seaweed being a reliable file system we can trust in our cluster.

System Setup

/usr/local/bin/weed server -volume=0 -filer -dir=/weedfs
/usr/local/bin/weed volume -max=400 -dir=/weedfs -mserver=10.8.149.13:9333 on 8 machines different from the master
These commands are done by systemctl services with setups exactly as the wiki page.
OS version: Ubuntu 22.04
output of weed version: version 8000GB 3.63 54d7748 linux amd64
if using filer, show the content of filer.toml: no filer.toml
security:

[access]
ui = false

[grpc]
ca = "/etc/ariesdockerd/certs/Aries_SeaweedFS_CA.crt"

[grpc.volume]
cert = "/etc/ariesdockerd/certs/volume01.crt"
key  = "/etc/ariesdockerd/certs/volume01.key"

[grpc.master]
cert = "/etc/ariesdockerd/certs/master01.crt"
key  = "/etc/ariesdockerd/certs/master01.key"

[grpc.filer]
cert = "/etc/ariesdockerd/certs/filer01.crt"
key  = "/etc/ariesdockerd/certs/filer01.key"

[grpc.client]
cert = "/etc/ariesdockerd/certs/client01.crt"
key  = "/etc/ariesdockerd/certs/client01.key"

Expected behavior
The client retries failed requests after a back-off delay, for up to a limit, e.g. 60 secs, then declare the failure.

Screenshots
Sample kind of errors:

I0312 06:49:42.407970 upload_content.go:95 assign volume failure count:1  path:"/************/000-114/105b2927193149ceb7dec74908264fee/": rpc error: code = Unavailable desc = error reading from server: read tcp 10.8.149.3:55350->10.8.149.13:18888: read: connection timed out
    copy file error: filerGrpcAddress assign volume: rpc error: code = Unavailable desc = error reading from server: read tcp 10.8.149.3:55350->10.8.149.13:18888: read: connection timed out

 W0312 06:49:49.159889 upload_content.go:170 uploading 0 to http://10.8.149.6:8080/903,081b656ab49c19b3: upload color_sample_0005_view_0001.png 104005 bytes to http://10.8.149.6:8080/903,081b656ab49c19b3: Post "http://10.8.149.6:8080/903,081b656ab49c19b3": EOF
W0312 06:49:52.099999 upload_content.go:170 uploading 2 to http://10.8.149.2:8080/902,081b90b10c607b14: upload color_sample_0000_view_0005.png 87185 bytes to http://10.8.149.2:8080/902,081b90b10c607b14: Post "http://10.8.149.2:8080/902,081b90b10c607b14": dial tcp 10.8.149.2:8080: connect: connection refused

W0311 23:21:14.855737 upload_content.go:170 uploading 0 to http://10.8.149.7:8080/318,02df5564b8ee4e5a: unmarshalled error http://10.8.149.7:8080/318,02df5564b8ee4e5a: reject because inflight upload data 268702962 > 268435456, and wait timeout