Removing an osd after hitting the timeout fails

Question

Removing an osd after hitting the timeout fails

peppepetra opened this issue 8 months ago · comments

Giuseppe Petralia commented 8 months ago

I tried to remove an osd but that didn't complete in time and I hit the timeout:

root@machine1:~# microceph disk remove osd.0
Removing osd.0, timeout 300s
Error: Failed to remove disk, timeout (300s) reached - abort

I waited for recovery to complete and then I tried to remove again but this time failed with:

root@machine1:~# microceph disk remove osd.0
Removing osd.0, timeout 300s
Error: Failed to remove disk: Failed to kill osd.0: Failed to run: pkill -f ceph-osd .* --id 0: exit status 1

This is probably happening because the osd is already killed

root@machine1:~# ps aux | grep ceph-osd
root        7460  2.9  2.4 2171584 1616484 ?     Ssl  Oct06 119:45 ceph-osd --cluster ceph --id 3
root        9923  5.4  3.3 2731796 2186620 ?     Ssl  Oct06 218:57 ceph-osd --cluster ceph --id 4
root       11750  4.7  3.1 2647836 2065424 ?     Ssl  Oct06 190:21 ceph-osd --cluster ceph --id 8
root      143989  9.4  2.3 2354512 1515484 ?     Ssl  09:16   9:52 ceph-osd --cluster ceph --id 15
root     2034734  8.5  3.2 2869556 2158212 ?     Ssl  06:52  21:07 ceph-osd --cluster ceph --id 12

As a workaround I can purge the osd with

ceph osd purge --yes-i-really-mean-it osd.0

and after that removal works.

Peter Sabaini · Answer 1 · Tue Oct 10 2023 00:07:04 GMT+0800 (China Standard Time)

This should be fixed