Imbalanced volume server workload during large concurrent upload.

Question

Imbalanced volume server workload during large concurrent upload.

eliphatfs opened this issue 3 months ago · comments

Describe the bug
I just installed a new cluster and was migrating the old file storage (glusterfs) to seaweed.
I use weed filer.copy -check.size=1 /old_storage/* http://10.8.149.13:8888/vol/ on 8 machines to migrate.
I visualized the volume QPS per node in Grafana.
I see that volume server workload is very imbalanced, and the 99th percentile response time of the busiest volume server is usually higher than good.

System Setup

/usr/local/bin/weed server -volume=0 -filer -dir=/weedfs
/usr/local/bin/weed volume -max=400 -dir=/weedfs -mserver=10.8.149.13:9333 on 8 machines different from the master
These commands are done by systemctl services with setups exactly as the wiki page.
OS version: Ubuntu 22.04
output of weed version: version 8000GB 3.63 54d7748 linux amd64
if using filer, show the content of filer.toml: no filer.toml
security:

[access]
ui = false

[grpc]
ca = "/etc/ariesdockerd/certs/Aries_SeaweedFS_CA.crt"

[grpc.volume]
cert = "/etc/ariesdockerd/certs/volume01.crt"
key  = "/etc/ariesdockerd/certs/volume01.key"

[grpc.master]
cert = "/etc/ariesdockerd/certs/master01.crt"
key  = "/etc/ariesdockerd/certs/master01.key"

[grpc.filer]
cert = "/etc/ariesdockerd/certs/filer01.crt"
key  = "/etc/ariesdockerd/certs/filer01.key"

[grpc.client]
cert = "/etc/ariesdockerd/certs/client01.crt"
key  = "/etc/ariesdockerd/certs/client01.key"

Expected behavior
The volume requests are balanced so the busiest one has an easier time. The current behavior is clearly worse than a random LB.

Screenshots

Additional context
May be the cause of #5367. I've disabled the JWT and applied the IP list instead.

Chris Lu · Answer 1 · Tue Mar 12 2024 23:10:13 GMT+0800 (China Standard Time)

please show the output of volume.list in weed shell.

Also, try to balance the volumes first.

Ruoxi · Answer 2 · Wed Mar 13 2024 01:14:54 GMT+0800 (China Standard Time)

volumes.log
Rebalancing takes quite a lot of resources on volume servers and given that they are already slow on some requests it seems unwise to run rebalance while the large upload is still running.