Provide a way to allow retries in FUSE, `filer.copy`, and other operations.
eliphatfs opened this issue · comments
Describe the bug
When there are a huge number of requests, it is inevitable that the network connection may break sometime.
For now, filer.copy
just prints a warning and (somehow may or may not) the whole process ends.
In either case it does not retry the failure.
I have to manually rerun the command after the previous one finishes with check.size
set to reliably migrate the storage.
The error log is not informative enough to know which files have failed.
weed mount
FUSE simply returns a Input/output error in such case, disrupting, for example, a machine learning training task.
These errors happens about once in a million, but they are preventing seaweed being a reliable file system we can trust in our cluster.
System Setup
/usr/local/bin/weed server -volume=0 -filer -dir=/weedfs
/usr/local/bin/weed volume -max=400 -dir=/weedfs -mserver=10.8.149.13:9333
on 8 machines different from the master- These commands are done by systemctl services with setups exactly as the wiki page.
- OS version: Ubuntu 22.04
- output of
weed version
: version 8000GB 3.63 54d7748 linux amd64 - if using filer, show the content of
filer.toml
: nofiler.toml
- security:
[access]
ui = false
[grpc]
ca = "/etc/ariesdockerd/certs/Aries_SeaweedFS_CA.crt"
[grpc.volume]
cert = "/etc/ariesdockerd/certs/volume01.crt"
key = "/etc/ariesdockerd/certs/volume01.key"
[grpc.master]
cert = "/etc/ariesdockerd/certs/master01.crt"
key = "/etc/ariesdockerd/certs/master01.key"
[grpc.filer]
cert = "/etc/ariesdockerd/certs/filer01.crt"
key = "/etc/ariesdockerd/certs/filer01.key"
[grpc.client]
cert = "/etc/ariesdockerd/certs/client01.crt"
key = "/etc/ariesdockerd/certs/client01.key"
Expected behavior
The client retries failed requests after a back-off delay, for up to a limit, e.g. 60 secs, then declare the failure.
Screenshots
Sample kind of errors:
I0312 06:49:42.407970 upload_content.go:95 assign volume failure count:1 path:"/************/000-114/105b2927193149ceb7dec74908264fee/": rpc error: code = Unavailable desc = error reading from server: read tcp 10.8.149.3:55350->10.8.149.13:18888: read: connection timed out
copy file error: filerGrpcAddress assign volume: rpc error: code = Unavailable desc = error reading from server: read tcp 10.8.149.3:55350->10.8.149.13:18888: read: connection timed out
W0312 06:49:49.159889 upload_content.go:170 uploading 0 to http://10.8.149.6:8080/903,081b656ab49c19b3: upload color_sample_0005_view_0001.png 104005 bytes to http://10.8.149.6:8080/903,081b656ab49c19b3: Post "http://10.8.149.6:8080/903,081b656ab49c19b3": EOF
W0312 06:49:52.099999 upload_content.go:170 uploading 2 to http://10.8.149.2:8080/902,081b90b10c607b14: upload color_sample_0000_view_0005.png 87185 bytes to http://10.8.149.2:8080/902,081b90b10c607b14: Post "http://10.8.149.2:8080/902,081b90b10c607b14": dial tcp 10.8.149.2:8080: connect: connection refused
W0311 23:21:14.855737 upload_content.go:170 uploading 0 to http://10.8.149.7:8080/318,02df5564b8ee4e5a: unmarshalled error http://10.8.149.7:8080/318,02df5564b8ee4e5a: reject because inflight upload data 268702962 > 268435456, and wait timeout