Microceph no longer tracking correct disks/osd
FLeiXiuS opened this issue · comments
Randomly one of my OSDs became unavailable and downed. I immediately started troubleshooting and noticed the microceph disk list
command showed the device that was originally at osd.8
as "available unpartitioned."
OSD.8 is the disk that is currently marked as down.
OSD.8 and OSD.9 some how are set to the exact same disk. Not sure how this happened as they were both added using the scsi-XXX
names
Current cluster is very unhappy because the OSD.8 is down/out.
Not particularly sure how to proceed as I cannot remove the disk with microceph disk remove OSD.8
as that particular disk is also OSD.9? I reach a timeout when attempting to do so.
OSD.9 is currently available and online.
OSD.8 is correctly marked in the ceph dashboard as being the right disk.
The microceph command is confused about which device is which OSD. Any suggestions?
The disk path is stored in the internal dqlite cluster (can be checked through sudo microceph cluster sql "select * from disks"
). A corruption in that entry is slightly less probable to occur. Can you tell a bit more about your environment or some recent occurence/performed operation that could be guilty?