How to determine if consistency check has truly fixed the data?

Question

How to determine if consistency check has truly fixed the data?

libo-sober opened this issue 9 months ago · comments

Imagine a common scenario:
A machine has broken a disk, and we need to perform disk replacement and reconstruction for the processes running on this disk.

There are some questions here：

If the cluster has a large amount of data, how long can the consistency check discover that the data maintained by these processes is lost and a new replica is added?
If the repair time is too long and other data center replicas also fail one after another, does it mean data loss?
Can running multiple consistency check processes accelerate checking cluster data status?Or can consistency checks be manually triggered for key values within a certain range?
Assuming it has successfully added replicas, how can we determine if there are 3 replicas and their locations for a corresponding key value pair in the cluster?(three_data_hall mode)

Thanks!

Jingyu Zhou · Answer 1 · Fri Sep 08 2023 08:36:35 GMT+0800 (China Standard Time)

Can you repost your questions on the forum? Most of your questions are already answered in old posts there. The github is intended for describing concrete issues or features.

To save your time, short answers for your questions are:

The consistency check takes up to a full checking cycle (configurable, default is maybe 2 weeks) to detect corrupted data.
If you replaced a disk, I think the behavior is the storage server process using it will crash. So within seconds, data distributor will detect this issue and start repair, which usually can finish within 30 minutes (a rough number, depending on your exact cluster settings).
Consistency check doesn't support multiple, cooperative processes. There is a debugging feature in fdbcli to check a certain range, i.e., checkall.
There is a debugging feature in fdbcli, i.e., getall and getlocation.