Filer with leveldb3 causes data loss; leveldb2 works fine

Question

Filer with leveldb3 causes data loss; leveldb2 works fine

vmihailenco opened this issue 3 months ago · comments

Vladimir Mihailenco commented 3 months ago

When using multiple Filer+S3 with leveldb3, the whole bucket can suddenly disappear after a bunch of files are deleted. Simply replacing it with leveldb2 makes the issue unreproducible.

There is nothing in the logs except that master receives notifications that a bunch of volumes are empty/removed when the bucket disappears. Sometimes the bucket is fully gone, sometimes it is just a missing folder. The volume seems to be gone too so it looks like filer decides that it is time to delete the bucket....

I was able to reproduce the issue by adding/removing a ClickHouse partition back and forth using S3.

Servers have a good network and cluster.check reports no issues.

version 8000GB 3.63 54d7748a4a54d94a31ce04d05db801faeff4f690 linux amd64

weed master -ip={{ weed_domain }} -ip.bind=0.0.0.0 -mdir=./master -defaultReplication=001 -volumePreallocate -disableHttp

weed filer -s3 -s3.config=/etc/seaweedfs/s3_config.json -master={{ masters | join(',') }}

weed volume -mserver={{ masters | join(',') }} -ip={{ weed_domain }} -ip.bind=0.0.0.0 -port={{ port }} -dir={{ dir }} -index=leveldb -max=0 -idleTimeout=60 -dataCenter=dc1 -rack=rack1

Filer config:

[filer.options]
recursive_delete = true
#max_file_name_length = 255

[leveldb3]
enabled = true
dir = "./filerldb3"

Hard to tell if that is enough to reproduce the issue, but I'd like to also confirm few things:

Is it okay that all filers logs are full of these messages (repeats every second):

I0309 11:20:57.355525 filer_grpc_server_sub_meta.go:130 read on disk filer:XXX.XXX.5.82:8888@XXX.XXX.5.82:63030 local subscribe / from 2024-03-09 10:20:37.530843927 +0000 UTC
I0309 11:20:57.355620 filer_grpc_server_sub_meta.go:149 read in memory filer:XXX.XXX.5.82:8888@XXX.XXX.5.82:63030 local subscribe / from 2024-03-09 10:20:37.530843927 +0000 UTC
I0309 11:20:58.414201 filer_grpc_server_sub_meta.go:296 + local listener filer:XXX.XXX.187.112:8888@XXX.XXX.187.112:13910 clientId -532674855 clientEpoch 16650
I0309 11:20:58.414227 filer_grpc_server_sub_meta.go:117  + filer:XXX.XXX.187.112:8888@XXX.XXX.187.112:13910 local subscribe / from 2024-03-09 08:41:49.725816183 +0000 UTC clientId:-532674855
I0309 11:20:58.414238 filer_grpc_server_sub_meta.go:112 disconnect filer:XXX.XXX.187.112:8888@XXX.XXX.187.112:45252 local subscriber / clientId:-532674855
I0309 11:20:58.414256 filer_grpc_server_sub_meta.go:312 - local listener filer:XXX.XXX.187.112:8888@XXX.XXX.187.112:45252 clientId -532674855 clientEpoch 16647

Is it okay to have orphan entries reported by volume.fsck? I have several filers, but they should all be in sync because I am not writing anything to them for few minutes.

I get orphan entries even with leveldb2 just by adding/removing a ClickHouse partition via S3... Unlike with filerldb3, it does not seem to cause any issues...
It looks like all the filers have all the data. Is is possible to change that? E.g. have a dedicated filer for the bucket without subscribing to other filers...
Is there any way to check if filers are in sync?

duanhongyi · Answer 1 · Wed May 08 2024 17:02:39 GMT+0800 (China Standard Time)

I have also encountered this issue, and now I have switched to leveldb2.