digitalocean / ceph_exporter

Prometheus exporter that scrapes meta information about a ceph cluster.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Ceph OSD still present after being removed from the cluster

a-nldisr opened this issue · comments

When removing an OSD from the cluster it does not vanish from the metrics.

Version: 1.0.0

Example:

We removed OSD.230 from the cluster after a complete node failure and the OSD therefor never returned after failure. We removed the osd, its keys and the crush rules. When we reload the exporter the metrics of the removed osd are not exported anymore. The moment we query a node where we did not reload the exporter we do keep getting the results:

ph_osd_avail_bytes{osd="osd.230"} 4.252156604e+12
ceph_osd_bytes{osd="osd.230"} 5.858434628e+12
ceph_osd_crush_weight{osd="osd.230"} 5.456085
ceph_osd_depth{osd="osd.230"} 2
ceph_osd_in{osd="osd.230"} 0
ceph_osd_perf_apply_latency_seconds{osd="osd.230"} 0
ceph_osd_perf_commit_latency_seconds{osd="osd.230"} 0
ceph_osd_pgs{osd="osd.230"} 131
10ceph_osd_reweight{osd="osd.230"} 1

ceph osd tree|grep 230 returns nothing.
ceph auth list |grep 230 returns nothing.

According to our audit logs the osd was removed from the cluster on 2017-05-22, more than 9 days ago.

We removed the osd by:
ceph osd crush remove osd.{osd-num}
ceph auth del osd.{osd-num}
ceph osd rm {osd-num}
ceph osd crush remove {host}

Ceph version:
10.2.5

Restarting exporter process clears the removed OSD's from the result.
We have 10 OSD's per node, all 10 removed OSD's from this node are still reported.

Currently we have an exporter running per ceph monitor node, we have 5 exporters running.

This seems the same issue as issue #48 but we do not see the osd's removed from the metrics after a day.

Are those OSDs removed from crush map? If yes, exporter restart usually helps.

We see a similar issue with renaming pools. After renaming a pool, ceph_exporter shows both the new pool name, and the old one. Restarting the exporter fixes the problem.

I have moved on and i am no longer using Ceph in my current position. Unsubscribing.

commented

Closing this in favor of #102 which better describes the issue