DB cannot be opened after node hard reset

Question

DB cannot be opened after node hard reset

fyfyrchik opened this issue a year ago · comments

Hello! I am not sure whether this is a bug in bolt, but I think you might find this interesting:

Create ext4 with fast_commit feature enabled (we use 5.10 kernel)
Create DB with default settings (i.e. no unsafe settings, (my only custom parameters are batch size and batch delay, they seem unrelated).
Perform server hard-reset under write load.
DB cannot be opened, bbolt check reports:

panic: assertion failed: Page expected to be: 23092, but self identifies as 0

goroutine 6 [running]:
go.etcd.io/bbolt._assert(...)
        /repo/bbolt/db.go:1359
go.etcd.io/bbolt.(*page).fastCheck(0x7f0b95834000, 0x5a34)
        /repo/bbolt/page.go:57 +0x1d9
go.etcd.io/bbolt.(*Tx).page(0x7f0b909be000?, 0x4f9cc0?)
        /repo/bbolt/tx.go:534 +0x79
go.etcd.io/bbolt.(*Tx).forEachPageInternal(0x7f0b8fe11000?, {0xc000024140?, 0x3, 0xa}, 0xc00008db68)
        /repo/bbolt/tx.go:546 +0x5d
go.etcd.io/bbolt.(*Tx).forEachPageInternal(0x7f0b8fe19000?, {0xc000024140?, 0x2, 0xa}, 0xc00008db68)
        /repo/bbolt/tx.go:555 +0xc8
go.etcd.io/bbolt.(*Tx).forEachPageInternal(0x0?, {0xc000024140?, 0x1, 0xa}, 0xc00008db68)
        /repo/bbolt/tx.go:555 +0xc8
go.etcd.io/bbolt.(*Tx).forEachPage(...)
        /repo/bbolt/tx.go:542
go.etcd.io/bbolt.(*Tx).checkBucket(0xc00013e000, 0xc00002a280, 0xc00008deb0, 0xc00008dee0, {0x54b920?, 0x62ec60}, 0xc00007c0c0)
        /repo/bbolt/tx_check.go:83 +0x111
go.etcd.io/bbolt.(*Tx).checkBucket.func2({0x7f0b8fe3e0a6?, 0xc000104d28?, 0xc00013e000?})
        /repo/bbolt/tx_check.go:110 +0x90
go.etcd.io/bbolt.(*Bucket).ForEachBucket(0x0?, 0xc00008dd70)
        /repo/bbolt/bucket.go:403 +0x96
go.etcd.io/bbolt.(*Tx).checkBucket(0xc00013e000, 0xc00013e018, 0xc000104eb0, 0xc000104ee0, {0x54b920?, 0x62ec60}, 0xc00007c0c0)
        /repo/bbolt/tx_check.go:108 +0x252
go.etcd.io/bbolt.(*Tx).check(0xc00013e000, {0x54b920, 0x62ec60}, 0x0?)
        /repo/bbolt/tx_check.go:61 +0x365
created by go.etcd.io/bbolt.(*Tx).CheckWithOptions in goroutine 1
        /repo/bbolt/tx_check.go:31 +0x118

The last 64 pages of the file seem to be filled with zeroes.

When the ext4 fast_commit is disabled, our tests pass, and db can be opened.
I have reproduced this on both 1.3.6 and 1.3.7.

Here is the output for bbolt pages https://gist.github.com/fyfyrchik/4aafec23d9dfc487fb4a4cd7f5560730

Meta pages

$ ./bbolt page ./1 0
Page ID:    0
Page Type:  meta
Total Size: 4096 bytes
Overflow pages: 0
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=26>
Freelist:   <pgid=97>
HWM:        <pgid=23157>
Txn ID:     1198
Checksum:   6397aaef7230fab5

$ ./bbolt page ./1 1
Page ID:    1
Page Type:  meta
Total Size: 4096 bytes
Overflow pages: 0
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=62>
Freelist:   <pgid=102>
HWM:        <pgid=23157>
Txn ID:     1199
Checksum:   8ec692c6ba15e06b

Benjamin Wang · Answer 1 · Wed Sep 06 2023 18:08:50 GMT+0800 (China Standard Time)

Here is the output for bbolt pages https://gist.github.com/fyfyrchik/4aafec23d9dfc487fb4a4cd7f5560730

There are only 2754 lines, but there are 23157 pages (each page has one line). Did you manually remove some lines from the outputted of bbolt pages path-2-db?

Just to double check, you easily reproduce this issue with fast_commit enabled, and never reproduced it with fast_commit disabled?

What's OS and Arch?

Could you share your test case? Do you test case write any huge object (e.g. a object > 10M)?

fyfyrchik · Answer 2 · Wed Sep 06 2023 19:49:53 GMT+0800 (China Standard Time)

Did you manually remove some lines from the outputted of bbolt pages path-2-db?

No, that's the full output.

Just to double check, you easily reproduce this issue with fast_commit enabled, and never reproduced it with fast_commit disabled?

Yes

What's OS and Arch?

Linux (Debian), 5.10 arch

Could you share your test case? Do you test case write any huge object (e.g. a object > 10M)?

All objects were less or equal than 128 KiB.
We have 3 buckets (for 32, 64, 128 KiB objects respectively).
On a single disk there are about 100 identical DB being written to, then we just start load and do multiple hard-reboots during the night, in the morning see the result.

Benjamin Wang · Answer 3 · Thu Sep 07 2023 01:00:03 GMT+0800 (China Standard Time)

thx @fyfyrchik for the feedback.

Could you share the db file?

Perform server hard-reset under write load.

Just to double check, do you mean you intentionally & manually power off/on the linux server on which your test was running?

fyfyrchik · Answer 4 · Fri Sep 08 2023 22:21:51 GMT+0800 (China Standard Time)

Yes, I can: the file is about ~100 MB, where should I send it?

Just to double check, do you mean you intentionally & manually power off/on the linux server on which your test was running?

Yes, we poweroff the storage node, loader is on a different one.

Benjamin Wang · Answer 5 · Fri Sep 08 2023 22:29:59 GMT+0800 (China Standard Time)

Yes, I can: the file is about ~100 MB, where should I send it?

Is it possible that you put the file somewhere that I can download? If yes, please send the link to wachao@vmware.com. If not, could you try to transfer the file to me via slack? Please reach out in K8s workspace (Benjamin Wang).

Yes, we poweroff the storage node, loader is on a different one.

Thanks for the info. You mount the storage node to the loader, and the loader accesses the storage node as a NFS?

Benjamin Wang · Answer 6 · Tue Oct 10 2023 19:54:29 GMT+0800 (China Standard Time)

Thanks @fyfyrchik for sharing the db file offline. The db file is corrupted. Unfortunately the existing check command can't provide any useful info.

I will enhance the existing check command:

try not to panic
supporting checking starting from a pageId.

Fu Wei · Answer 7 · Mon Nov 13 2023 16:55:10 GMT+0800 (China Standard Time)

Hi @fyfyrchik ,

For the Perform server hard-reset under write load., did you mean something like echo b > /proc/sysrq-trigger to force reboot?

fyfyrchik · Answer 8 · Mon Nov 13 2023 20:44:18 GMT+0800 (China Standard Time)

Hi @fuweid,
Yes (or like echo c?). In our case, the reboot was done via BMC console.

Fu Wei · Answer 9 · Mon Nov 13 2023 20:46:16 GMT+0800 (China Standard Time)

@fyfyrchik thanks! I will try to reproduce this by dm-flakey

Fu Wei · Answer 10 · Wed Dec 20 2023 17:27:12 GMT+0800 (China Standard Time)

Updated.

@fyfyrchik since my local env is using v6.4 kernel, I build v5.10 kernel and run it with qemu.

#!/bin/bash

set -exuo pipefail

readonly db_file=/var/lib/etcd/xxx.db
mkdir -p /var/lib/etcd

if [[ -f "${db_file}" ]]; then
  ls -alh $db_file
  bbolt check $db_file
fi


bbolt bench -work -path $db_file -count=1000000000 -batch-size=100 -value-size=4096 2> /dev/null &
sleep $(( ( RANDOM % 10 )  + 1 ))
echo b > /proc/sysrq-trigger

I can reproduce this issue with main branch.

panic: assertion failed: Page expected to be: 56763, but self identifies as 0

goroutine 6 [running]:
go.etcd.io/bbolt/internal/common.Assert(...)
        /home/fuwei/workspace/bbolt/internal/common/verify.go:65
go.etcd.io/bbolt/internal/common.(*Page).FastCheck(0x7f59cd5bb000, 0xddbb)
        /home/fuwei/workspace/bbolt/internal/common/page.go:83 +0x1d9
go.etcd.io/bbolt.(*Tx).page(0x7f59cd51e000?, 0x5dc120?)
        /home/fuwei/workspace/bbolt/tx.go:545 +0x79
go.etcd.io/bbolt.(*Tx).forEachPageInternal(0x7f59c61b6000?, {0xc0000200f0?, 0x4, 0xa}, 0xc000131b60)
        /home/fuwei/workspace/bbolt/tx.go:557 +0x5d
go.etcd.io/bbolt.(*Tx).forEachPageInternal(0x7f59c6cfe000?, {0xc0000200f0?, 0x3, 0xa}, 0xc000131b60)
        /home/fuwei/workspace/bbolt/tx.go:566 +0xc5
go.etcd.io/bbolt.(*Tx).forEachPageInternal(0x7f59c6d95000?, {0xc0000200f0?, 0x2, 0xa}, 0xc000131b60)
        /home/fuwei/workspace/bbolt/tx.go:566 +0xc5
go.etcd.io/bbolt.(*Tx).forEachPageInternal(0xc000104b38?, {0xc0000200f0?, 0x1, 0xa}, 0xc000131b60)
        /home/fuwei/workspace/bbolt/tx.go:566 +0xc5
go.etcd.io/bbolt.(*Tx).forEachPage(...)
        /home/fuwei/workspace/bbolt/tx.go:553
go.etcd.io/bbolt.(*Tx).checkBucket(0xc000146000, 0xc000024140, 0xc000131eb0, 0xc000131ee0, {0x67bd40?, 0x80b420}, 0xc00007e0c0)
        /home/fuwei/workspace/bbolt/tx_check.go:82 +0x111
go.etcd.io/bbolt.(*Tx).checkBucket.func2({0x7f59c6d96020?, 0xc000146000?, 0xc000104d20?})
        /home/fuwei/workspace/bbolt/tx_check.go:109 +0x90
go.etcd.io/bbolt.(*Bucket).ForEachBucket(0x0?, 0xc000131d68)
        /home/fuwei/workspace/bbolt/bucket.go:451 +0x96
go.etcd.io/bbolt.(*Tx).checkBucket(0xc000146000, 0xc000146018, 0xc000104eb0, 0xc000104ee0, {0x67bd40?, 0x80b420}, 0xc00007e0c0)
        /home/fuwei/workspace/bbolt/tx_check.go:107 +0x252
go.etcd.io/bbolt.(*Tx).check(0xc000146000, {0x67bd40, 0x80b420}, 0x0?)
        /home/fuwei/workspace/bbolt/tx_check.go:60 +0x365
created by go.etcd.io/bbolt.(*Tx).Check in goroutine 1
        /home/fuwei/workspace/bbolt/tx_check.go:30 +0x118

56886    unknown<00> 0
56887    unknown<00> 0
56888    unknown<00> 0
56889    unknown<00> 0
56890    unknown<00> 0
56891    unknown<00> 0
56892    unknown<00> 0
56893    unknown<00> 0
56894    unknown<00> 0
56895    unknown<00> 0
56896    unknown<00> 0
56897    unknown<00> 0
56898    unknown<00> 0
56899    unknown<00> 0
56900    unknown<00> 0
56901    unknown<00> 0
56902    unknown<00> 0
56903    unknown<00> 0
56904    unknown<00> 0
56905    unknown<00> 0
56906    unknown<00> 0
56907    unknown<00> 0
56908    unknown<00> 0
56909    unknown<00> 0
56910    unknown<00> 0
56911    unknown<00> 0
56912    unknown<00> 0
56913    free
56914    free
56915    leaf       2      2
56918    leaf       2      2
56921    leaf       2      2
56924    leaf       2      2
56927    leaf       2      2
56930    leaf       2      2
56933    leaf       2      2
56936    leaf       2      2
56939    leaf       2      2
56942    leaf       2      2
56945    leaf       2      2
56948    leaf       2      2
56951    leaf       2      2
56954    leaf       2      2
56957    leaf       2      2

I am not sure it's related to fast-commit because I didn't find relative patch for this.
Trying to test it without fast-commit.

cc @ahrtr

Fu Wei · Answer 11 · Wed Dec 20 2023 20:36:45 GMT+0800 (China Standard Time)

Found a relative patch about Fast Commit: https://lore.kernel.org/linux-ext4/20211223032337.5198-3-yinxin.x@bytedance.com/. And confirmed with that patch's author and fast commit in v5.10 has some corner case which can loss data even if fsync/fdatasync returns successful...

@fyfyrchik I think I would like to suggest to upgrade kernel or disable fc option 😂

NOTE:

Other patch is here https://lore.kernel.org/lkml/202201091544.W5HHEXAp-lkp@intel.com/T/ which is covered by https://kernel.googlesource.com/pub/scm/fs/xfs/xfstests-dev/+/refs/heads/master/tests/generic/455

Fu Wei · Answer 12 · Wed Dec 20 2023 20:48:33 GMT+0800 (China Standard Time)

I can't reproduce this issue in new kernel with fc option or v5.10 without fc option.

And talked to patch's author and I believe that this is kernel bug instead of bbolt.

@fyfyrchik Thanks for that reporting!

Benjamin Wang · Answer 13 · Wed Dec 20 2023 21:32:37 GMT+0800 (China Standard Time)

Found a relative patch about Fast Commit: https://lore.kernel.org/linux-ext4/20211223032337.5198-3-yinxin.x@bytedance.com/. And confirmed with that patch's author and fast commit in v5.10 has some corner case which can loss data even if fsync/fdatasync returns successful...

Great finding.

Just as we discussed earlier today, we received around 18 data corruption cases in the past 2 years. Most likely fast commit wasn't enabled at all in most cases.

Enable the nightly workflow.
Let's keep the robustness testing running for a long time (e.g. > 4 weeks).
Can we also keep the test "bbolt xxxx; echo b > /proc/sysrq-trigger" for a long time (e.g > 4 weeks)

Benjamin Wang · Answer 14 · Wed Dec 20 2023 21:40:16 GMT+0800 (China Standard Time)

Also please let's add a known issue in the readme to clarify

do not enable fast_commit on which kernel version
which kernel version contains the fix

Fu Wei · Answer 15 · Wed Dec 20 2023 22:15:31 GMT+0800 (China Standard Time)

Can we also keep the test "bbolt xxxx; echo b > /proc/sysrq-trigger" for a long time (e.g > 4 weeks)

Will do and think about how to submit it as code in repo. Assign this issue to me.

/assign

fyfyrchik · Answer 16 · Thu Dec 21 2023 16:21:04 GMT+0800 (China Standard Time)

@fuweid thanks for your research!
In case someone will be wondering, I have tracked down the version this patch was added:
The second link is present in 5.17 (handle parameter) https://elixir.bootlin.com/linux/v5.17/source/fs/ext4/fast_commit.c#L307
And in 5.15.144, 5.16.... with the latest fixes..
So to fix the problem one might want to try >= 5.17 kernel.

Fu Wei · Answer 17 · Thu Dec 21 2023 16:42:38 GMT+0800 (China Standard Time)

Thanks @fyfyrchik ! Yeah, fast commit is new feature and when we enable this, we need to run failpoint test cases on that. It's not just bbolt, but also other applications which cares data persistence.

Benjamin Wang · Answer 18 · Sat Feb 24 2024 22:31:11 GMT+0800 (China Standard Time)

For anyone reference, from application (bbolt in this case)'s perspective, the root cause of the issue is that the system call Fdatasync returns successful (err == nil) but actually the kernel hasn't actually synced the data successfully yet. It may lead to a situation that the meta page is persisted successfully, but some data pages are not yet. Eventually some pageIds may point to corrupted data pages.