Filer with leveldb3 causes data loss; leveldb2 works fine
vmihailenco opened this issue · comments
When using multiple Filer+S3 with leveldb3, the whole bucket can suddenly disappear after a bunch of files are deleted. Simply replacing it with leveldb2 makes the issue unreproducible.
There is nothing in the logs except that master receives notifications that a bunch of volumes are empty/removed when the bucket disappears. Sometimes the bucket is fully gone, sometimes it is just a missing folder. The volume seems to be gone too so it looks like filer decides that it is time to delete the bucket....
I was able to reproduce the issue by adding/removing a ClickHouse partition back and forth using S3.
Servers have a good network and cluster.check
reports no issues.
version 8000GB 3.63 54d7748a4a54d94a31ce04d05db801faeff4f690 linux amd64
weed master -ip={{ weed_domain }} -ip.bind=0.0.0.0 -mdir=./master -defaultReplication=001 -volumePreallocate -disableHttp
weed filer -s3 -s3.config=/etc/seaweedfs/s3_config.json -master={{ masters | join(',') }}
weed volume -mserver={{ masters | join(',') }} -ip={{ weed_domain }} -ip.bind=0.0.0.0 -port={{ port }} -dir={{ dir }} -index=leveldb -max=0 -idleTimeout=60 -dataCenter=dc1 -rack=rack1
Filer config:
[filer.options]
recursive_delete = true
#max_file_name_length = 255
[leveldb3]
enabled = true
dir = "./filerldb3"
Hard to tell if that is enough to reproduce the issue, but I'd like to also confirm few things:
- Is it okay that all filers logs are full of these messages (repeats every second):
I0309 11:20:57.355525 filer_grpc_server_sub_meta.go:130 read on disk filer:XXX.XXX.5.82:8888@XXX.XXX.5.82:63030 local subscribe / from 2024-03-09 10:20:37.530843927 +0000 UTC
I0309 11:20:57.355620 filer_grpc_server_sub_meta.go:149 read in memory filer:XXX.XXX.5.82:8888@XXX.XXX.5.82:63030 local subscribe / from 2024-03-09 10:20:37.530843927 +0000 UTC
I0309 11:20:58.414201 filer_grpc_server_sub_meta.go:296 + local listener filer:XXX.XXX.187.112:8888@XXX.XXX.187.112:13910 clientId -532674855 clientEpoch 16650
I0309 11:20:58.414227 filer_grpc_server_sub_meta.go:117 + filer:XXX.XXX.187.112:8888@XXX.XXX.187.112:13910 local subscribe / from 2024-03-09 08:41:49.725816183 +0000 UTC clientId:-532674855
I0309 11:20:58.414238 filer_grpc_server_sub_meta.go:112 disconnect filer:XXX.XXX.187.112:8888@XXX.XXX.187.112:45252 local subscriber / clientId:-532674855
I0309 11:20:58.414256 filer_grpc_server_sub_meta.go:312 - local listener filer:XXX.XXX.187.112:8888@XXX.XXX.187.112:45252 clientId -532674855 clientEpoch 16647
-
Is it okay to have orphan entries reported by
volume.fsck
? I have several filers, but they should all be in sync because I am not writing anything to them for few minutes.I get orphan entries even with leveldb2 just by adding/removing a ClickHouse partition via S3... Unlike with filerldb3, it does not seem to cause any issues...
-
It looks like all the filers have all the data. Is is possible to change that? E.g. have a dedicated filer for the bucket without subscribing to other filers...
-
Is there any way to check if filers are in sync?
I have also encountered this issue, and now I have switched to leveldb2.