Imbalanced volume server workload during large concurrent upload.
eliphatfs opened this issue · comments
Describe the bug
I just installed a new cluster and was migrating the old file storage (glusterfs) to seaweed.
I use weed filer.copy -check.size=1 /old_storage/* http://10.8.149.13:8888/vol/
on 8 machines to migrate.
I visualized the volume QPS per node in Grafana.
I see that volume server workload is very imbalanced, and the 99th percentile response time of the busiest volume server is usually higher than good.
System Setup
/usr/local/bin/weed server -volume=0 -filer -dir=/weedfs
/usr/local/bin/weed volume -max=400 -dir=/weedfs -mserver=10.8.149.13:9333
on 8 machines different from the master- These commands are done by systemctl services with setups exactly as the wiki page.
- OS version: Ubuntu 22.04
- output of
weed version
: version 8000GB 3.63 54d7748 linux amd64 - if using filer, show the content of
filer.toml
: nofiler.toml
- security:
[access]
ui = false
[grpc]
ca = "/etc/ariesdockerd/certs/Aries_SeaweedFS_CA.crt"
[grpc.volume]
cert = "/etc/ariesdockerd/certs/volume01.crt"
key = "/etc/ariesdockerd/certs/volume01.key"
[grpc.master]
cert = "/etc/ariesdockerd/certs/master01.crt"
key = "/etc/ariesdockerd/certs/master01.key"
[grpc.filer]
cert = "/etc/ariesdockerd/certs/filer01.crt"
key = "/etc/ariesdockerd/certs/filer01.key"
[grpc.client]
cert = "/etc/ariesdockerd/certs/client01.crt"
key = "/etc/ariesdockerd/certs/client01.key"
Expected behavior
The volume requests are balanced so the busiest one has an easier time. The current behavior is clearly worse than a random LB.
Additional context
May be the cause of #5367. I've disabled the JWT and applied the IP list instead.
please show the output of volume.list
in weed shell
.
Also, try to balance the volumes first.
volumes.log
Rebalancing takes quite a lot of resources on volume servers and given that they are already slow on some requests it seems unwise to run rebalance while the large upload is still running.