digitalocean / ceph_exporter

Prometheus exporter that scrapes meta information about a ceph cluster.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

curl <host>:9128/metrics hangs

davecore82 opened this issue · comments

I have two ceph clusters where I run ceph_exporter the same way. In cluster A, I can curl the host:9128/metrics and get the metrics, but in cluster B the curl command just hangs.

I think my question is more on how I can troubleshoot this? The two ceph clusters are healthy. There are no down OSDs in ceph osd tree.

What would the sequence of steps be to troubleshoot why this curl hangs?

I also see this error in the logs for ceph_exporter:

2020/10/08 15:15:59 [ERROR] cannot extract total objects: strconv.ParseFloat: parsing "": invalid syntax

I found the culprit in my environment. I am using a snap package of prometheus-ceph-exporter version 3.0.0-nautilus revision 21 from latest/edge which adds a new ceph command to data collection. I found it in ceph.audit.log on the queried ceph-mon unit:

2020-10-15 13:46:22.110129 mon.<host> (mon.2) 528995 : audit [DBG] from='client.? v1:172.31.250.48:0/33662233' entity='client.prometheus-ceph-exporter-az1' cmd=[{"dumpcontents":["pgs_brief"],"format":"json","prefix":"pg dump"}]: dispatch

The ceph pg dump --format json command takes around 100 seconds to return in my 2 clusters. The plain format takes around 4 to 6 seconds. The ceph exporter wants the json format so it hangs for around 100 seconds waiting for the data. But the prometheus scrape job has a default timeout of 15 seconds, so the target will show as down with "Context Deadline Exceeded".

I reverted the snap package back to prometheus-ceph-exporter version 2.0.0 revision 20 from latest/stable for now. The /metrics target scrape now returns in 227ms in prometheus.

But now I wonder, am I the only seeing a long running ceph pg dump --format json command? This will probably break many prometheus scrape jobs if it takes 100 seconds to run when the default scrape timeout is 15 seconds.

Does anyone else see this?

@davecore82 We only run ceph pg dump pgs_brief --format json as far as I recall which should be faster than pg dump. Can you let me know how many PGs and OSDs you have in your cluster? I've never seen that command take that long to return

I have 336 OSDs and 18947 PGs.

It takes about 1 second to run ceph pg dump pgs_brief --format json and about 60 seconds to run ceph pg dump --format json.

Thanks! I'll try to have a look at this but this may take a bit

(cc @yuezhu since he's currently working on an issue that feels very similar)

@davecore82 We just upgraded go-ceph binding, and made the connection short lived by having a connection on demand for each mon command. Would you please build a docker image with the latest nautilus branch via docker build -t digitalocean/ceph_exporter . --no-cache, and see if the hang still exists?

Hey @davecore82, we (DO) were also seeing symptoms like what you describe in our environment, and we believe that @yuezhu 's #184 should address the issue - it's looking good so far for us. We've also filed https://tracker.ceph.com/issues/48052 upstream as we believe this may represent an issue with request forwarding in Nautilus.