ZFS send recv raw, when reversed results in corrupted volume

Question

ZFS send recv raw, when reversed results in corrupted volume

rnoons opened this issue 3 months ago · comments

System information

Type	Version/Name
Distribution Name	Debian
Distribution Version	11.10
Kernel Version	5.10
Architecture	x86_64
OpenZFS Version	2.0.3

Describe the problem you're observing

While utilizing encryption, sending a raw zfs send back to the originating host results in a corrupted pool

errors: Permanent errors have been detected in the following files:
pool/location1:<0x0>

zpool clear/scrub do not resolve the issue.

Describe how to reproduce the problem

Host1 pool (created with encryption) and Host2 pool (created with encryption)
Host1 pool/location1 (inherited key from pool), Host2 pool/location1 (key inherited from itself pool/location1)

Host1: execute zfs send -w (raw) pool/location1:snap1 > (Host2) zfs recv pool/location1

Reboot both hosts, load keys, mount everything works as expected.

Now reverse this in a failback scenario, for instance you're now writing data on host2 and want to fail back to host1 to maintain redundancy. (Both systems have the same snapshot history)

Host2: zfs snapshot pool/location1:snap2
Host2: zfs send -w (raw) -i pool/location1:snap1 pool/location1:snap2 > (Host1) zfs recv pool/location1

Again everything replicates just fine. Host2 shows the source key of pool/location1, while host1 shows the key innhereted from pool

Now reboot Host1: zfs load key pool; zfs load key pool/location1; zfs mount -a
Reboot Host2:
zpool status

errors: Permanent errors have been detected in the following files:
pool/location1:<0x0>

zfs load key pool (works fine) and pool/location1

pool/location1 is unmountable.

I've also attempted to change-key on host1 before this failback scenario, resulting in pool/location1 to be disjointed from the pool source-key, however this did not help

In all cases the encryption key is passphrase and matches both the pool and pool/location1 and I seem to be able to load the key successfully.

Include any warning/errors/backtraces from the system logs

Rich Ercolani · Answer 1 · Sun Jul 14 2024 09:50:36 GMT+0800 (China Standard Time)

Try again with a version newer than 2021.

rnoons · Answer 2 · Mon Jul 15 2024 06:10:44 GMT+0800 (China Standard Time)

This is interesting. I had previously skimmed through the change log and didn't see any bugs in relation to this, with exception to possibly #12000. Honestly I wasn't certain this was even related as I wasn't issueing -d on the receiver side. On my test servers and in my attempt to repro and resolve I did upgrade to zfs to 2.1.11 (latest avail backport) and it did not fix the issue, however I overlooked the fact that if I try to repro with a fresh zpool, repeating all my steps above the issue in fact no longer exists.

Anyone have anyidea what the source of this bug is and if there is a way to recover post upgrade without a complete fresh send? It obviously has to do with they key, but find the whole thing a little mind boggling. Host one the key is obviously contianed with the rpool and replicates to host2 correcly, whcih survives multiple reboots and replications. It's only when syncing the child rpool/location1 from host2 back to host1.

rnoons · Answer 3 · Thu Jul 18 2024 01:51:45 GMT+0800 (China Standard Time)

Actualy I was able to resolve this upgrading zfs to 2.1.11+, then doing one final incremental snapahost to the corrupted volume post upgrade. Then you can either export/import or reboot and it does in fact recover. I have not been able to find a way to fix this without a replica, but this is at least sufficent for my needs. Hope this is helpful to somone else.