`microceph disk add` sometimes fail but returns 0

Question

`microceph disk add` sometimes fail but returns 0

simondeziel opened this issue 4 months ago · comments

Issue report

In a CI script set to abort on errors, the microceph disk add --wipe encountered a errors but didn't return != 0:

+ sudo microceph disk add --wipe /dev/sdb

+----------+---------+
|   PATH   | STATUS  |
+----------+---------+
| /dev/sdb | Failure |
+----------+---------+
Error:  failed to bootstrap OSD: Failed to run: ceph-osd --mkfs --no-mon-config -i 1: exit status 250 (2024-01-29T22:30:14.664+0000 7f1570e998c0 -1 bluestore(/var/lib/ceph/osd/ceph-1/block) _read_bdev_label unable to decode label at offset 102: void bluestore_bdev_label_t::decode(ceph::buffer::v15_2_0::list::const_iterator&) decode past end of struct encoding: Malformed input
2024-01-29T22:30:14.664+0000 7f1570e998c0 -1 bluestore(/var/lib/ceph/osd/ceph-1/block) _read_bdev_label unable to decode label at offset 102: void bluestore_bdev_label_t::decode(ceph::buffer::v15_2_0::list::const_iterator&) decode past end of struct encoding: Malformed input
2024-01-29T22:30:14.664+0000 7f1570e998c0 -1 bluestore(/var/lib/ceph/osd/ceph-1/block) _read_bdev_label unable to decode label at offset 102: void bluestore_bdev_label_t::decode(ceph::buffer::v15_2_0::list::const_iterator&) decode past end of struct encoding: Malformed input
2024-01-29T22:30:14.676+0000 7f1570e998c0 -1 bdev(0x563a03578000 /var/lib/ceph/osd/ceph-1/block) open open got: (16) Device or resource busy
2024-01-29T22:30:14.676+0000 7f1570e998c0 -1 bluestore(/var/lib/ceph/osd/ceph-1) mkfs failed, (16) Device or resource busy
2024-01-29T22:30:14.676+0000 7f1570e998c0 -1 OSD::mkfs: ObjectStore::mkfs failed with error (16) Device or resource busy
2024-01-29T22:30:14.676+0000 7f1570e998c0 -1  ** ERROR: error creating empty object store in /var/lib/ceph/osd/ceph-1: (16) Device or resource busy)
+ sudo rm -rf /etc/ceph
+ sudo ln -s /var/snap/microceph/current/conf/ /etc/ceph
...
+ sudo microceph.ceph status
  cluster:
    id:     594f8038-eb9d-4381-9707-4a622a23fd97
    health: HEALTH_WARN
            1 MDSs report slow metadata IOs
            nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim flag(s) set
            Reduced data availability: 65 pgs inactive
            3 pool(s) have no replicas configured
            OSD count 0 < osd_pool_default_size 1
 
  services:
    mon: 1 daemons, quorum fv-az665-985 (age 2m)
    mgr: fv-az665-985(active, since 2m)
    mds: 1/1 daemons up
    osd: 0 osds: 0 up, 0 in
         flags nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim
 
  data:
    volumes: 1/1 healthy
    pools:   3 pools, 65 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:     100.000% pgs unknown
             65 unknown
...

What version of MicroCeph are you using ?

$ sudo snap install microceph --edge
microceph (reef/edge) 18.2.0+snape56a71f5dd from Canonical** installed

What are the steps to reproduce this issue ?

https://github.com/canonical/lxd/actions/runs/7703310904/workflow?pr=12783#L270-L319 has it all but essentially:

sudo snap install microceph --edge
sudo swapoff /mnt/swapfile
sudo umount /mnt # umount the ephemeral disk of GitHub Action runner
sudo microceph disk add --wipe "${ephemeral_disk}" # try to give the ephemeral disk to microceph

What happens (observed behaviour) ?

microceph disk add --wipe returned 0 despite running into errors.

What were you expecting to happen ?

microceph disk add --wipe should return != 0 on error.

Relevant logs, error output, etc.

https://github.com/canonical/lxd/actions/runs/7703310904/job/20993429312?pr=12783#step:10:328

Utkarsh Bhatt · Answer 1 · Mon Feb 05 2024 23:23:52 GMT+0800 (China Standard Time)

Thanks a lot for reporting this bug @simondeziel. This was fixed by #291
Marking this issue closed.

Simon Deziel · Answer 2 · Mon Feb 05 2024 23:53:54 GMT+0800 (China Standard Time)

@UtkarshBhatthere many thanks for the quick turnaround!