`microceph disk add` sometimes fail but returns 0
simondeziel opened this issue · comments
Simon Deziel commented
Issue report
In a CI script set to abort on errors, the microceph disk add --wipe
encountered a errors but didn't return != 0:
+ sudo microceph disk add --wipe /dev/sdb
+----------+---------+
| PATH | STATUS |
+----------+---------+
| /dev/sdb | Failure |
+----------+---------+
Error: failed to bootstrap OSD: Failed to run: ceph-osd --mkfs --no-mon-config -i 1: exit status 250 (2024-01-29T22:30:14.664+0000 7f1570e998c0 -1 bluestore(/var/lib/ceph/osd/ceph-1/block) _read_bdev_label unable to decode label at offset 102: void bluestore_bdev_label_t::decode(ceph::buffer::v15_2_0::list::const_iterator&) decode past end of struct encoding: Malformed input
2024-01-29T22:30:14.664+0000 7f1570e998c0 -1 bluestore(/var/lib/ceph/osd/ceph-1/block) _read_bdev_label unable to decode label at offset 102: void bluestore_bdev_label_t::decode(ceph::buffer::v15_2_0::list::const_iterator&) decode past end of struct encoding: Malformed input
2024-01-29T22:30:14.664+0000 7f1570e998c0 -1 bluestore(/var/lib/ceph/osd/ceph-1/block) _read_bdev_label unable to decode label at offset 102: void bluestore_bdev_label_t::decode(ceph::buffer::v15_2_0::list::const_iterator&) decode past end of struct encoding: Malformed input
2024-01-29T22:30:14.676+0000 7f1570e998c0 -1 bdev(0x563a03578000 /var/lib/ceph/osd/ceph-1/block) open open got: (16) Device or resource busy
2024-01-29T22:30:14.676+0000 7f1570e998c0 -1 bluestore(/var/lib/ceph/osd/ceph-1) mkfs failed, (16) Device or resource busy
2024-01-29T22:30:14.676+0000 7f1570e998c0 -1 OSD::mkfs: ObjectStore::mkfs failed with error (16) Device or resource busy
2024-01-29T22:30:14.676+0000 7f1570e998c0 -1 ** ERROR: error creating empty object store in /var/lib/ceph/osd/ceph-1: (16) Device or resource busy)
+ sudo rm -rf /etc/ceph
+ sudo ln -s /var/snap/microceph/current/conf/ /etc/ceph
...
+ sudo microceph.ceph status
cluster:
id: 594f8038-eb9d-4381-9707-4a622a23fd97
health: HEALTH_WARN
1 MDSs report slow metadata IOs
nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim flag(s) set
Reduced data availability: 65 pgs inactive
3 pool(s) have no replicas configured
OSD count 0 < osd_pool_default_size 1
services:
mon: 1 daemons, quorum fv-az665-985 (age 2m)
mgr: fv-az665-985(active, since 2m)
mds: 1/1 daemons up
osd: 0 osds: 0 up, 0 in
flags nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim
data:
volumes: 1/1 healthy
pools: 3 pools, 65 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs: 100.000% pgs unknown
65 unknown
...
What version of MicroCeph are you using ?
$ sudo snap install microceph --edge
microceph (reef/edge) 18.2.0+snape56a71f5dd from Canonical** installed
What are the steps to reproduce this issue ?
https://github.com/canonical/lxd/actions/runs/7703310904/workflow?pr=12783#L270-L319 has it all but essentially:
sudo snap install microceph --edge
sudo swapoff /mnt/swapfile
sudo umount /mnt
# umount the ephemeral disk of GitHub Action runnersudo microceph disk add --wipe "${ephemeral_disk}"
# try to give the ephemeral disk to microceph
What happens (observed behaviour) ?
microceph disk add --wipe
returned 0 despite running into errors.
What were you expecting to happen ?
microceph disk add --wipe
should return != 0 on error.
Relevant logs, error output, etc.
https://github.com/canonical/lxd/actions/runs/7703310904/job/20993429312?pr=12783#step:10:328
Utkarsh Bhatt commented
Thanks a lot for reporting this bug @simondeziel. This was fixed by #291
Marking this issue closed.
Simon Deziel commented
@UtkarshBhatthere many thanks for the quick turnaround!