Disk corruption without zfs_vdev_disk_classic=1 for a single virtual machine.

Question

Disk corruption without zfs_vdev_disk_classic=1 for a single virtual machine.

angstymeat opened this issue 2 months ago · comments

System information

Type	Version/Name
Distribution Name	Fedora Core
Distribution Version	23
Kernel Version	4.8.13-100.fc23.x86_64
Architecture	x86_64
OpenZFS Version	zfs-2.2.99-534_gc98295e

Describe the problem you're observing

I'm migrating our virtual machines from VMWare ESXi to Proxmox 8.22. It has gone smoothly except for a single virtual machine is one of our older systems running proprietary software which is why it is still running Fedora Core 23. I have four disks connected over an SAS JBOD that are pass-thru directly to the VM, the same configuration they had under VMWare.

This VM immediately began exhibiting disk corruption, reporting numerous read, write, and checksum errors. I immediately stopped it and booted up the VMWare version using the same disks and scrubbed the pool. No errors were reported.

I have at least a dozen other virtual machines that I have migrated to Proxmox, also using ZFS, most of them the latest version, and none of them exhibit this issue. The VM configuration (hardware type, CPU type, etc) is the same between all of them (memory size & cpu # varies).

Some of them are Fedora Core 18. Some are CentOS 7, some are CentOS 8. None of them have this issue.

The VM was originally FC22 when I migrated, and thinking it was a kernel issue I updated it to FC23 (the kernel went from 4.4 to 4.8), however the issue persisted.

While searching I came across #15533, which exhibited the same symptoms but I'm not running on top of LUKS or anything. When I applied zfs_vdev_disk_classic=1, the errors went away

Again, none of my other VMs need this option set. Other than the kernel versions, I can't figure out what is different or why this is happening. We either use older kernels like 3.10 under CentOS 7, or newer ones like 4.11 and above (FC24, CentOS 8, etc.).

Describe how to reproduce the problem

Currently, I can get this to occur regularly using this particular zpool on this particular machine under Proxmox 8.2.2, but not under VMWare. I boot them machine and start running our software which performs many small reads and writes in multiple threads (it is collecting seismic data from multiple sources) to memory-mapped files.

Include any warning/errors/backtraces from the system logs

Under FC22 I would see many errors about losing connection to the disks in the system logs. However, I did not hold onto those errors while I was debugging. These errors would not appear on the Proxmox host, on the the VM.

Under FC23 i get the following:

Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918200]  snd_hda_codec irqbypass iTCO_vendor_support crct10dif_pclmul crc32_pclmul snd_hda_core snd_hwdep crc32c_intel snd_seq snd_seq_device ghash_clmulni_intel snd_pcm intel_rapl_perf i2c_i801 i2c_smbus virtio_balloon joydev snd_timer snd lpc_ich soundcore shpchp acpi_cpufreq tpm_tis tpm_tis_core tpm qemu_fw_cfg nfsd auth_rpcgss nfs_acl lockd grace sunrpc virtio_net virtio_console virtio_scsi bochs_drm drm_kms_helper ttm drm serio_raw virtio_pci virtio_ring virtio lz4 lz4_compress
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918216] CPU: 0 PID: 958 Comm: z_wr_int Tainted: P        W  OE   4.8.13-100.fc23.x86_64 #1
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918217] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918217]  0000000000000286 000000007e16f741 ffff9c0259177a10 ffffffffa73e496e
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918219]  0000000000000000 0000000000000000 ffff9c0259177a50 ffffffffa70a0ecb
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918220]  000002d600000000 ffff9c025903e340 0000000000000000 0000000000001000
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918221] Call Trace:
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918225]  [<ffffffffa73e496e>] dump_stack+0x63/0x85
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918226]  [<ffffffffa70a0ecb>] __warn+0xcb/0xf0
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918227]  [<ffffffffa70a0ffd>] warn_slowpath_null+0x1d/0x20
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918273]  [<ffffffffc099084a>] vbio_fill_cb+0x15a/0x190 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918319]  [<ffffffffc09906f0>] ? vbio_completion+0xa0/0xa0 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918358]  [<ffffffffc085b17e>] abd_iterate_page_func+0xce/0x190 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918405]  [<ffffffffc09910c5>] vdev_disk_io_rw+0x1d5/0x2e0 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918453]  [<ffffffffc098fb81>] vdev_disk_io_start+0x161/0x490 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918507]  [<ffffffffc0980ac2>] zio_vdev_io_start+0x142/0x310 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918557]  [<ffffffffc097f4ff>] zio_execute+0x8f/0xf0 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918606]  [<ffffffffc0932f03>] vdev_queue_io_done+0x123/0x220 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918654]  [<ffffffffc097daca>] zio_vdev_io_done+0x9a/0x210 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918701]  [<ffffffffc097f4ff>] zio_execute+0x8f/0xf0 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918706]  [<ffffffffc06cf4fd>] taskq_thread+0x29d/0x4d0 [spl]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918708]  [<ffffffffa70cbb50>] ? wake_up_q+0x70/0x70
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918754]  [<ffffffffc097f470>] ? zio_reexecute+0x4a0/0x4a0 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918759]  [<ffffffffc06cf260>] ? taskq_thread_spawn+0x60/0x60 [spl]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918760]  [<ffffffffa70c0bf8>] kthread+0xd8/0xf0
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918762]  [<ffffffffa77ffdff>] ret_from_fork+0x1f/0x40
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918763]  [<ffffffffa70c0b20>] ? kthread_worker_fn+0x170/0x170
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918774] ---[ end trace 368f8d93b1defe8c ]---
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918800] ------------[ cut here ]------------

Michael Love · Answer 1 · Tue Jun 18 2024 15:38:47 GMT+0800 (China Standard Time)

Well, all that did was push the errors of for a couple of hours instead of them happening immediately

Tony Hutter · Answer 2 · Wed Jun 19 2024 02:57:51 GMT+0800 (China Standard Time)

Do you happen to have the first lines of the kernel panic? Looks like they may have been cut off.

Michael Love · Answer 3 · Wed Jun 19 2024 03:49:30 GMT+0800 (China Standard Time)

It looks like I cut it off. Here's a full one I think:

Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.917870] ------------[ cut here ]------------
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.917920] WARNING: CPU: 0 PID: 958 at /root/src/zfs/module/os/linux/zfs/vdev_disk.c:726 vbio_fill_cb+0x15a/0x190 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.917968]  [<ffffffffc097f4ff>] zio_execute+0x8f/0xf0 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918017]  [<ffffffffc0932f03>] vdev_queue_io_done+0x123/0x220 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918065]  [<ffffffffc097daca>] zio_vdev_io_done+0x9a/0x210 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918066] Modules linked in: edac_core
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918115]  [<ffffffffc097f4ff>] zio_execute+0x8f/0xf0 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918118]  kvm_intel [<ffffffffc06cf4fd>] taskq_thread+0x29d/0x4d0 [spl]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918126]  [<ffffffffa70cbb50>] ? wake_up_q+0x70/0x70
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918128]  zfs(POE)
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918176]  [<ffffffffc097f470>] ? zio_reexecute+0x4a0/0x4a0 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918177]  spl(OE) kvm
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918184]  [<ffffffffc06cf260>] ? taskq_thread_spawn+0x60/0x60 [spl]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918187]  [<ffffffffa70c0bf8>] kthread+0xd8/0xf0
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918189]  iTCO_wdt
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918192]  [<ffffffffa77ffdff>] ret_from_fork+0x1f/0x40
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918195]  snd_hda_intel
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918195]  [<ffffffffa70c0b20>] ? kthread_worker_fn+0x170/0x170
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918199] ---[ end trace 368f8d93b1defe8b ]---
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918200]  snd_hda_codec irqbypass iTCO_vendor_support crct10dif_pclmul crc32_pclmul snd_hda_core snd_hwdep crc32c_intel snd_seq snd_seq_device ghash_clmulni_intel snd_pcm intel_rapl_perf i2c_i801 i2c_smbus virtio_balloon joydev snd_timer snd lpc_ich soundcore shpchp acpi_cpufreq tpm_tis tpm_tis_core tpm qemu_fw_cfg nfsd auth_rpcgss nfs_acl lockd grace sunrpc virtio_net virtio_console virtio_scsi bochs_drm drm_kms_helper ttm drm serio_raw virtio_pci virtio_ring virtio lz4 lz4_compress
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918216] CPU: 0 PID: 958 Comm: z_wr_int Tainted: P        W  OE   4.8.13-100.fc23.x86_64 #1
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918217] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918217]  0000000000000286 000000007e16f741 ffff9c0259177a10 ffffffffa73e496e
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918219]  0000000000000000 0000000000000000 ffff9c0259177a50 ffffffffa70a0ecb
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918220]  000002d600000000 ffff9c025903e340 0000000000000000 0000000000001000
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918221] Call Trace:
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918225]  [<ffffffffa73e496e>] dump_stack+0x63/0x85
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918226]  [<ffffffffa70a0ecb>] __warn+0xcb/0xf0
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918227]  [<ffffffffa70a0ffd>] warn_slowpath_null+0x1d/0x20
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918273]  [<ffffffffc099084a>] vbio_fill_cb+0x15a/0x190 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918319]  [<ffffffffc09906f0>] ? vbio_completion+0xa0/0xa0 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918358]  [<ffffffffc085b17e>] abd_iterate_page_func+0xce/0x190 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918405]  [<ffffffffc09910c5>] vdev_disk_io_rw+0x1d5/0x2e0 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918453]  [<ffffffffc098fb81>] vdev_disk_io_start+0x161/0x490 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918507]  [<ffffffffc0980ac2>] zio_vdev_io_start+0x142/0x310 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918557]  [<ffffffffc097f4ff>] zio_execute+0x8f/0xf0 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918606]  [<ffffffffc0932f03>] vdev_queue_io_done+0x123/0x220 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918654]  [<ffffffffc097daca>] zio_vdev_io_done+0x9a/0x210 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918701]  [<ffffffffc097f4ff>] zio_execute+0x8f/0xf0 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918706]  [<ffffffffc06cf4fd>] taskq_thread+0x29d/0x4d0 [spl]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918708]  [<ffffffffa70cbb50>] ? wake_up_q+0x70/0x70
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918754]  [<ffffffffc097f470>] ? zio_reexecute+0x4a0/0x4a0 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918759]  [<ffffffffc06cf260>] ? taskq_thread_spawn+0x60/0x60 [spl]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918760]  [<ffffffffa70c0bf8>] kthread+0xd8/0xf0
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918762]  [<ffffffffa77ffdff>] ret_from_fork+0x1f/0x40
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918763]  [<ffffffffa70c0b20>] ? kthread_worker_fn+0x170/0x170
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918774] ---[ end trace 368f8d93b1defe8c ]---
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918800] ------------[ cut here ]------------

I get it for multiple CPUs. Here's one of the others:

Jun 17 22:39:44 durga kernel: : [   14.881105] ------------[ cut here ]------------
Jun 17 22:39:44 durga kernel: : [   14.881153] WARNING: CPU: 2 PID: 2009 at /root/src/zfs/module/os/linux/zfs/vdev_disk.c:726 vbio_fill_cb+0x15a/0x190 [zfs]
Jun 17 22:39:44 durga kernel: : [   14.881154]  0000000000001000
Jun 17 22:39:44 durga kernel: : [   14.881159] Modules linked in: nf_conntrack_ipv4
Jun 17 22:39:44 durga kernel: : [   14.881163] Call Trace:
Jun 17 22:39:44 durga kernel: : [   14.881165]  nf_defrag_ipv4 xt_multiport ip6t_REJECT
Jun 17 22:39:44 durga kernel: : [   14.881174]  [<ffffffffa63e496e>] dump_stack+0x63/0x85
Jun 17 22:39:44 durga kernel: : [   14.881176]  nf_reject_ipv6
Jun 17 22:39:44 durga kernel: : [   14.881179]  [<ffffffffa60a0ecb>] __warn+0xcb/0xf0
Jun 17 22:39:44 durga kernel: : [   14.881182]  [<ffffffffa60a0ffd>] warn_slowpath_null+0x1d/0x20
Jun 17 22:39:44 durga kernel: : [   14.881184]  nf_conntrack_ipv6 nf_defrag_ipv6
Jun 17 22:39:44 durga kernel: : [   14.881233]  [<ffffffffc0f5a84a>] vbio_fill_cb+0x15a/0x190 [zfs]
Jun 17 22:39:44 durga kernel: : [   14.881281]  [<ffffffffc0f5a6f0>] ? vbio_completion+0xa0/0xa0 [zfs]
Jun 17 22:39:44 durga kernel: : [   14.881319]  [<ffffffffc0e2517e>] abd_iterate_page_func+0xce/0x190 [zfs]
Jun 17 22:39:44 durga kernel: : [   14.881368]  [<ffffffffc0f5b0c5>] vdev_disk_io_rw+0x1d5/0x2e0 [zfs]
Jun 17 22:39:44 durga kernel: : [   14.881406]  [<ffffffffc0e24205>] ? abd_alloc_struct+0x45/0x70 [zfs]
Jun 17 22:39:44 durga kernel: : [   14.881407]  xt_conntrack nf_conntrack ip6table_filter
Jun 17 22:39:44 durga kernel: : [   14.881459]  [<ffffffffc0f59b81>] vdev_disk_io_start+0x161/0x490 [zfs]
Jun 17 22:39:44 durga kernel: : [   14.881462]  ip6_tables
Jun 17 22:39:44 durga kernel: : [   14.881511]  [<ffffffffc0f4aac2>] zio_vdev_io_start+0x142/0x310 [zfs]
Jun 17 22:39:44 durga kernel: : [   14.881514]  zfs(POE) [<ffffffffc0f4d8a5>] zio_nowait+0xc5/0x170 [zfs]
Jun 17 22:39:44 durga kernel: : [   14.881611]  [<ffffffffc0efcfa3>] vdev_queue_io_done+0x1c3/0x220 [zfs]
Jun 17 22:39:44 durga kernel: : [   14.881613]  edac_core kvm_intel
Jun 17 22:39:44 durga kernel: : [   14.881663]  [<ffffffffc0f47aca>] zio_vdev_io_done+0x9a/0x210 [zfs]
Jun 17 22:39:44 durga kernel: : [   14.881713]  iTCO_wdt
Jun 17 22:39:44 durga kernel: : [   14.881713]  [<ffffffffc0f494ff>] zio_execute+0x8f/0xf0 [zfs]
Jun 17 22:39:44 durga kernel: : [   14.881720]  iTCO_vendor_support
Jun 17 22:39:44 durga kernel: : [   14.881720]  [<ffffffffc04ff4fd>] taskq_thread+0x29d/0x4d0 [spl]
Jun 17 22:39:44 durga kernel: : [   14.881723]  kvm
Jun 17 22:39:44 durga kernel: : [   14.881728]  [<ffffffffa60cbb50>] ? wake_up_q+0x70/0x70
Jun 17 22:39:44 durga kernel: : [   14.881729]  snd_hda_intel
Jun 17 22:39:44 durga kernel: : [   14.881777]  [<ffffffffc0f49470>] ? zio_reexecute+0x4a0/0x4a0 [zfs]
Jun 17 22:39:44 durga kernel: : [   14.881779]  snd_hda_codec
Jun 17 22:39:44 durga kernel: : [   14.881784]  [<ffffffffc04ff260>] ? taskq_thread_spawn+0x60/0x60 [spl]
Jun 17 22:39:44 durga kernel: : [   14.881786]  irqbypass crct10dif_pclmul crc32_pclmul
Jun 17 22:39:44 durga kernel: : [   14.881793]  [<ffffffffa60c0bf8>] kthread+0xd8/0xf0
Jun 17 22:39:44 durga kernel: : [   14.881797]  [<ffffffffa67ffdff>] ret_from_fork+0x1f/0x40
Jun 17 22:39:44 durga kernel: : [   14.881798]  crc32c_intel snd_hda_core ghash_clmulni_intel intel_rapl_perf snd_hwdep
Jun 17 22:39:44 durga kernel: : [   14.881806]  [<ffffffffa60c0b20>] ? kthread_worker_fn+0x170/0x170
Jun 17 22:39:44 durga kernel: : [   14.881808]  spl(OE)
Jun 17 22:39:44 durga kernel: : [   14.881810] ---[ end trace c4c71651078292e1 ]---
Jun 17 22:39:44 durga kernel: : [   14.881812]  snd_seq snd_seq_device
Jun 17 22:39:44 durga kernel: : [   14.881817] CPU: 0 PID: 2112 Comm: z_wr_int Tainted: P        W  OE   4.8.13-100.fc23.x86_64 #1
Jun 17 22:39:44 durga kernel: : [   14.881819]  snd_pcm
Jun 17 22:39:44 durga kernel: : [   14.881821] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
Jun 17 22:39:44 durga kernel: : [   14.881823]  snd_timer
Jun 17 22:39:44 durga kernel: : [   14.881825]  0000000000000286 joydev
Jun 17 22:39:44 durga kernel: : [   14.881830]  00000000f9daebd5 i2c_i801
Jun 17 22:39:44 durga kernel: : [   14.881834]  ffffa1892107ba00 snd
Jun 17 22:39:44 durga kernel: : [   14.881838]  ffffffffa63e496e<4>[   14.881840] ------------[ cut here ]------------

It looks like that one got messed up being written to the syslog.

There's a lot of them, but it looks like the same set of several error repeating.

And then I start seeing these:

Jun 18 01:11:25 durga kernel: : [ 4632.968395] sd 9:0:0:3: [sdd] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 18 01:11:25 durga kernel: : [ 4632.968407] sd 9:0:0:3: [sdd] tag#1 Sense Key : Illegal Request [current]
Jun 18 01:11:25 durga kernel: : [ 4632.968409] sd 9:0:0:3: [sdd] tag#1 Add. Sense: Invalid field in cdb
Jun 18 01:11:25 durga kernel: : [ 4632.968412] sd 9:0:0:3: [sdd] tag#1 CDB: Write(16) 8a 00 00 00 00 00 37 45 ec 90 00 00 07 e8 00 00
Jun 18 01:11:25 durga kernel: : [ 4632.968414] blk_update_request: critical target error, dev sdd, sector 927329424
Jun 18 01:11:25 durga kernel: : [ 4632.968442] zio pool=storage vdev=/dev/disk/by-id/wwn-0x5000039818400991-part1 error=121 type=2 offset=474791616512 size=1036288 flags=1074267264
Jun 18 01:11:25 durga kernel: : [ 4632.972261] sd 10:0:0:4: [sde] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 18 01:11:25 durga kernel: : [ 4632.972266] sd 10:0:0:4: [sde] tag#1 Sense Key : Illegal Request [current]
Jun 18 01:11:25 durga kernel: : [ 4632.972267] sd 10:0:0:4: [sde] tag#1 Add. Sense: Invalid field in cdb
Jun 18 01:11:25 durga kernel: : [ 4632.972270] sd 10:0:0:4: [sde] tag#1 CDB: Write(16) 8a 00 00 00 00 00 37 45 ec 90 00 00 07 e8 00 00
Jun 18 01:11:25 durga kernel: : [ 4632.972271] blk_update_request: critical target error, dev sde, sector 927329424
Jun 18 01:11:25 durga kernel: : [ 4632.972319] zio pool=storage vdev=/dev/disk/by-id/wwn-0x5000039858417669-part1 error=121 type=2 offset=474791616512 size=1036288 flags=1074267264

The pool:

  pool: storage
 state: ONLINE
  scan: resilvered 2.64T in 11:05:04 with 0 errors on Tue Jun 18 13:08:04 2024
config:

	NAME                        STATE     READ WRITE CKSUM
	storage                     ONLINE       0     0     0
	  mirror-0                  ONLINE       0     0     0
	    wwn-0x50000398183bd8d1  ONLINE       0     0     0
	    scsi-3500003985840f9c9  ONLINE       0     0     0
	  mirror-1                  ONLINE       0     0     0
	    wwn-0x5000039858417669  ONLINE       0     0     0
	    wwn-0x5000039818400991  ONLINE       0     0     0

When this happens I get hundreds of read/write/checksum errors until the pool faults-out. When I shut it down and load up the VMWare virtual machine with the same raw disks and run zpool clear and zpool scrub I end up with no errors.

I've snapshotted the Proxmox VM and I'm going to try it out after updating to FC24, which is what another of our old systems is running along with a similar zpool that off the same JBOD that isn't having problems.

Michael Love · Answer 4 · Thu Jun 20 2024 08:13:55 GMT+0800 (China Standard Time)

After updating it to FC24 with kernel 4.11.12-100.fc24.x86_64 I'm not having disk errors or corruption reported anymore. The system has been stable for 18 hours.

Rob Norris · Answer 5 · Thu Jun 20 2024 08:41:44 GMT+0800 (China Standard Time)

I started looking into this last night, but had nothing to report yet. Your last comment is interesting.

Can you clarify specific RHEL/FC versions and kernel versions you tried, including the extended kernel versioning gunk that these kernels have, and whether or not they failed or not? At least, I'd like in/out version ranges.

I don't have a solid theory, just a few smells.

Red Hat backports heavily from newer kernels into their shipped ones, for example, the "3.10" kernel that ships with EL7 actually has some stuff from 5.x pulled back into it. This sort of thing sometimes requires us to go to some lengths inside OpenZFS to keep things running well, because version numbers stop being a good indicator of when a particular feature or behaviour changed.

For a couple of months, the new BIO submission code (zfs_vdev_disk_classic=0) caused memory corruption on 3.10-EL7. I eventually traced it to a change that I think happened in upstream Linux 4.5, so I added a version check to change the behaviour on kernels <4.5. That sorted out 3.10-EL7.

The crash output you're showing has similar shapes, which is making me wonder if either 4.5 is not the right place to draw the line, or if RHEL/FC kernels are modified in such a way that they didn't get the changed behaviour until 4.11-FC24, or something of that kind.

Or, maybe a different problem entirely! But yeah, that's why specific versions would help. Thanks :)

The rough theory I'm guessing at is something like this. Red Hat backport heavily into their kernels, for example, the "3.10" kernel that ships with EL7 actually has some stuff from 5.x pulled back into it. So it's not alwa

Michael Love · Answer 6 · Thu Jun 20 2024 10:41:26 GMT+0800 (China Standard Time)

I keep forgetting that RH back-ports features and fixes. That can make diagnosis tricky.

The machine that's having problems started with Fedora 22 kernel 4.4.14-200.fc22.x86_64, then went to F23 with kernel 4.8.13-100.fc23.x86_64. When I updated to Fedora 24 with kernel 4.11.12-100.fc24.x86_64 I stopped having problems.

All of our CentOS 7, CentOS 8, Rocky Linux 8, and Rocky Linux 9 are ok, as are Fedora 18. I think those are all of the OS version we are running with ZFS and pass-through disks.

The vm I'm having problems with has run for years under VMWare ESXI 5.5. I only had problems after migrating to Proxmox 8.2.2, so it has to either have something to do with an interaction between the guest OS and QEMU, or something between that and Proxmox itself.

The disk errors I saw were only in the virtual machine's logs and nothing in the logs of the host server. Even though the virtual machine reported disk corruption, mounting them in another VM and scrubbing them revealed no corruption or errors.

Interestingly, one of the search results that came up when I was looking for similar issues was one I posted years ago when they discovered there was a memory issue on NUMA systems and fixed it.