Microceph no longer tracking correct disks/osd

Question

Microceph no longer tracking correct disks/osd

FLeiXiuS opened this issue 5 months ago · comments

Randomly one of my OSDs became unavailable and downed. I immediately started troubleshooting and noticed the microceph disk list command showed the device that was originally at osd.8 as "available unpartitioned."

OSD.8 is the disk that is currently marked as down.

OSD.8 and OSD.9 some how are set to the exact same disk. Not sure how this happened as they were both added using the scsi-XXX names

Current cluster is very unhappy because the OSD.8 is down/out.

Not particularly sure how to proceed as I cannot remove the disk with microceph disk remove OSD.8 as that particular disk is also OSD.9? I reach a timeout when attempting to do so.

OSD.9 is currently available and online.

OSD.8 is correctly marked in the ceph dashboard as being the right disk.

Version:

FLeiXiuS · Answer 1 · Wed Dec 20 2023 01:18:26 GMT+0800 (China Standard Time)

The microceph command is confused about which device is which OSD. Any suggestions?

Utkarsh Bhatt · Answer 2 · Wed Dec 20 2023 18:40:31 GMT+0800 (China Standard Time)

The disk path is stored in the internal dqlite cluster (can be checked through sudo microceph cluster sql "select * from disks"). A corruption in that entry is slightly less probable to occur. Can you tell a bit more about your environment or some recent occurence/performed operation that could be guilty?

FLeiXiuS · Answer 3 · Thu Dec 21 2023 11:17:03 GMT+0800 (China Standard Time)

Sure, this is a homelab for testing microceph. I'm using 3 USB 3.2 DAS storage devices for a 3 node cluster. Each with 4 drives per device. 12 Total OSDs.