canonical / microceph

Ceph for a one-rack cluster and appliances

Home Page:https://snapcraft.io/microceph

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Microceph no longer tracking correct disks/osd

FLeiXiuS opened this issue · comments

Randomly one of my OSDs became unavailable and downed. I immediately started troubleshooting and noticed the microceph disk list command showed the device that was originally at osd.8 as "available unpartitioned."

OSD.8 is the disk that is currently marked as down.

image

OSD.8 and OSD.9 some how are set to the exact same disk. Not sure how this happened as they were both added using the scsi-XXX names
image

Current cluster is very unhappy because the OSD.8 is down/out.
image

Not particularly sure how to proceed as I cannot remove the disk with microceph disk remove OSD.8 as that particular disk is also OSD.9? I reach a timeout when attempting to do so.

OSD.9 is currently available and online.
image

OSD.8 is correctly marked in the ceph dashboard as being the right disk.
image

Version:
image

The microceph command is confused about which device is which OSD. Any suggestions?

The disk path is stored in the internal dqlite cluster (can be checked through sudo microceph cluster sql "select * from disks"). A corruption in that entry is slightly less probable to occur. Can you tell a bit more about your environment or some recent occurence/performed operation that could be guilty?

Sure, this is a homelab for testing microceph. I'm using 3 USB 3.2 DAS storage devices for a 3 node cluster. Each with 4 drives per device. 12 Total OSDs.

image