seaweedfs / seaweedfs

SeaweedFS is a fast distributed storage system for blobs, objects, files, and data lake, for billions of files! Blob store has O(1) disk seek, cloud tiering. Filer supports Cloud Drive, cross-DC active-active replication, Kubernetes, POSIX FUSE mount, S3 API, S3 Gateway, Hadoop, WebDAV, encryption, Erasure Coding.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Imbalanced volume server workload during large concurrent upload.

eliphatfs opened this issue · comments

Describe the bug
I just installed a new cluster and was migrating the old file storage (glusterfs) to seaweed.
I use weed filer.copy -check.size=1 /old_storage/* http://10.8.149.13:8888/vol/ on 8 machines to migrate.
I visualized the volume QPS per node in Grafana.
I see that volume server workload is very imbalanced, and the 99th percentile response time of the busiest volume server is usually higher than good.

System Setup

  • /usr/local/bin/weed server -volume=0 -filer -dir=/weedfs
  • /usr/local/bin/weed volume -max=400 -dir=/weedfs -mserver=10.8.149.13:9333 on 8 machines different from the master
  • These commands are done by systemctl services with setups exactly as the wiki page.
  • OS version: Ubuntu 22.04
  • output of weed version: version 8000GB 3.63 54d7748 linux amd64
  • if using filer, show the content of filer.toml: no filer.toml
  • security:
[access]
ui = false

[grpc]
ca = "/etc/ariesdockerd/certs/Aries_SeaweedFS_CA.crt"

[grpc.volume]
cert = "/etc/ariesdockerd/certs/volume01.crt"
key  = "/etc/ariesdockerd/certs/volume01.key"

[grpc.master]
cert = "/etc/ariesdockerd/certs/master01.crt"
key  = "/etc/ariesdockerd/certs/master01.key"

[grpc.filer]
cert = "/etc/ariesdockerd/certs/filer01.crt"
key  = "/etc/ariesdockerd/certs/filer01.key"

[grpc.client]
cert = "/etc/ariesdockerd/certs/client01.crt"
key  = "/etc/ariesdockerd/certs/client01.key"

Expected behavior
The volume requests are balanced so the busiest one has an easier time. The current behavior is clearly worse than a random LB.

Screenshots
image
image

Additional context
May be the cause of #5367. I've disabled the JWT and applied the IP list instead.

please show the output of volume.list in weed shell.

Also, try to balance the volumes first.

volumes.log
Rebalancing takes quite a lot of resources on volume servers and given that they are already slow on some requests it seems unwise to run rebalance while the large upload is still running.