splitStorageMetrics stuck when waiting for locations reduce at consistency check
xis19 opened this issue · comments
When running consistency check, we check for splits storage metrics. In #10066 , we calls splitStorageMetrics
to get the number of splits. During the call, we check the number of key range locations by calling getKeyRangeLocations
.
When the number of locations (NoL) is larger than a given value, CLIENT_KNOBS->STORAGE_METRICS_SHARD_LIMIT
, we will wait and retry checking NoL, until it is smaller than the limit. By default, the limit is 100, yet when buggify
is on, it will have a chance to be set at 3.
Now, in the consistency check, the database has no changes, when NoL reaches 3 or more, it is possible that the whole process stuck at splitStorageMetrics
waiting the change of key locations.
A quick fix might be increase the number of NoL limit, but we may need more insight on this before brutally fixing this.
This issue is found in #10140, at git hash 0a3ffbf86815cf9e78360a6e6f0d44687b791418
with tests:
bin/fdbserver -r simulation --crash -s 1404029917 -b on -f /root/src/tests/fast/MutationLogReaderCorrectness.toml
bin/fdbserver -r simulation --crash -s 2151331495 -b on -f /root/src/tests/fast/MutationLogReaderCorrectness.toml