Exporter fails when some OSDs not online (luminous)

Question

Exporter fails when some OSDs not online (luminous)

goebi opened this issue 6 years ago · comments

The ceph_exporter from the luminous-2.0.0. branch fails when some OSDs not online. It seems there are duplicates values. Ceph Release is 12.2.7. We using the official docker container with luminous-2.0.0 tag...

An error has occurred during metrics gathering:

4 error(s) occurred:

collected metric ceph_osd_down label:<name:"cluster" value:"ceph" > label:<name:"osd" value:"osd.40" > label:<name:"status" value:"down" > gauge:<value:1 > was collected before with the same name and label values
collected metric ceph_osd_down label:<name:"cluster" value:"ceph" > label:<name:"osd" value:"osd.35" > label:<name:"status" value:"down" > gauge:<value:1 > was collected before with the same name and label values
collected metric ceph_osd_down label:<name:"cluster" value:"ceph" > label:<name:"osd" value:"osd.36" > label:<name:"status" value:"down" > gauge:<value:1 > was collected before with the same name and label values
collected metric ceph_osd_down label:<name:"cluster" value:"ceph" > label:<name:"osd" value:"osd.41" > label:<name:"status" value:"down" > gauge:<value:1 > was collected before with the same name and label values

When all OSDs online, everything work as expected.

Christian Berendt · Answer 1 · Mon Sep 17 2018 16:21:27 GMT+0800 (China Standard Time)

Any news here? This issue is a blocker because the luminous-2.0.0 tag is not usable at the moment because of this issue.

Vaibhav Bhembre · Answer 2 · Thu Sep 20 2018 06:11:39 GMT+0800 (China Standard Time)

Hi, would it be possible to share the output of ceph osd tree down -f json-pretty on your cluster? I have a hunch that OSDs in nodes and stray arrays might have overlapped.

localhost · Answer 3 · Thu Dec 20 2018 15:40:17 GMT+0800 (China Standard Time)

We have more than 30 clusters. so only use one。。。。。。 It seems there are duplicates values.

collected metric ceph_osd_down label:<name:"cluster" value:"xxxx-region-1" > label:<name:"osd" value:"osd.56" > label:<name:"status" value:"up" > gauge:<value:1 > was collected before with the same name and label values
collected metric ceph_osd_down label:<name:"cluster" value:"xxxx-region-1" > label:<name:"osd" value:"osd.169" > label:<name:"status" value:"up" > gauge:<value:1 > was collected before with the same name and label values
collected metric ceph_osd_down label:<name:"cluster" value:"xxxx-region-1" > label:<name:"osd" value:"osd.45" > label:<name:"status" value:"up" > gauge:<value:1 > was collected before with the same name and label values
collected metric ceph_osd_down label:<name:"cluster" value:"xxxx-region-1" > label:<name:"osd" value:"osd.158" > label:<name:"status" value:"up" > gauge:<value:1 > was collected before with the same name and label values
collected metric ceph_osd_down label:<name:"cluster" value:"xxxx-region-1" > label:<name:"osd" value:"osd.49" > label:<name:"status" value:"up" > gauge:<value:1 > was collected before with the same name and label values

localhost · Answer 4 · Thu Dec 20 2018 15:41:43 GMT+0800 (China Standard Time)

Hi, would it be possible to share the output of ceph osd tree down -f json-pretty on your cluster? I have a hunch that OSDs in nodes and stray arrays might have overlapped.

How do I resolve this problem?

cause use command "ceph osd crush link xxxx"

Kyle · Answer 5 · Thu Mar 24 2022 02:37:04 GMT+0800 (China Standard Time)

Hi, this shouldn't be an issue anymore with our latest nautilus based release. Please re-open if you experience this again.