Removing an osd after hitting the timeout fails
peppepetra opened this issue · comments
Giuseppe Petralia commented
I tried to remove an osd but that didn't complete in time and I hit the timeout:
root@machine1:~# microceph disk remove osd.0
Removing osd.0, timeout 300s
Error: Failed to remove disk, timeout (300s) reached - abort
I waited for recovery to complete and then I tried to remove again but this time failed with:
root@machine1:~# microceph disk remove osd.0
Removing osd.0, timeout 300s
Error: Failed to remove disk: Failed to kill osd.0: Failed to run: pkill -f ceph-osd .* --id 0: exit status 1
This is probably happening because the osd is already killed
root@machine1:~# ps aux | grep ceph-osd
root 7460 2.9 2.4 2171584 1616484 ? Ssl Oct06 119:45 ceph-osd --cluster ceph --id 3
root 9923 5.4 3.3 2731796 2186620 ? Ssl Oct06 218:57 ceph-osd --cluster ceph --id 4
root 11750 4.7 3.1 2647836 2065424 ? Ssl Oct06 190:21 ceph-osd --cluster ceph --id 8
root 143989 9.4 2.3 2354512 1515484 ? Ssl 09:16 9:52 ceph-osd --cluster ceph --id 15
root 2034734 8.5 3.2 2869556 2158212 ? Ssl 06:52 21:07 ceph-osd --cluster ceph --id 12
As a workaround I can purge the osd with
ceph osd purge --yes-i-really-mean-it osd.0
and after that removal works.
Peter Sabaini commented
This should be fixed