curl <host>:9128/metrics hangs

Question

curl <host>:9128/metrics hangs

davecore82 opened this issue 4 years ago · comments

I have two ceph clusters where I run ceph_exporter the same way. In cluster A, I can curl the host:9128/metrics and get the metrics, but in cluster B the curl command just hangs.

I think my question is more on how I can troubleshoot this? The two ceph clusters are healthy. There are no down OSDs in ceph osd tree.

What would the sequence of steps be to troubleshoot why this curl hangs?

David Coronel · Answer 1 · Thu Oct 08 2020 23:17:04 GMT+0800 (China Standard Time)

I also see this error in the logs for ceph_exporter:

2020/10/08 15:15:59 [ERROR] cannot extract total objects: strconv.ParseFloat: parsing "": invalid syntax

David Coronel · Answer 2 · Thu Oct 15 2020 22:00:23 GMT+0800 (China Standard Time)

I found the culprit in my environment. I am using a snap package of prometheus-ceph-exporter version 3.0.0-nautilus revision 21 from latest/edge which adds a new ceph command to data collection. I found it in ceph.audit.log on the queried ceph-mon unit:

2020-10-15 13:46:22.110129 mon.<host> (mon.2) 528995 : audit [DBG] from='client.? v1:172.31.250.48:0/33662233' entity='client.prometheus-ceph-exporter-az1' cmd=[{"dumpcontents":["pgs_brief"],"format":"json","prefix":"pg dump"}]: dispatch

The ceph pg dump --format json command takes around 100 seconds to return in my 2 clusters. The plain format takes around 4 to 6 seconds. The ceph exporter wants the json format so it hangs for around 100 seconds waiting for the data. But the prometheus scrape job has a default timeout of 15 seconds, so the target will show as down with "Context Deadline Exceeded".

I reverted the snap package back to prometheus-ceph-exporter version 2.0.0 revision 20 from latest/stable for now. The /metrics target scrape now returns in 227ms in prometheus.

David Coronel · Answer 3 · Fri Oct 16 2020 01:03:54 GMT+0800 (China Standard Time)

But now I wonder, am I the only seeing a long running ceph pg dump --format json command? This will probably break many prometheus scrape jobs if it takes 100 seconds to run when the default scrape timeout is 15 seconds.

Does anyone else see this?

Alexandre Marangone · Answer 4 · Wed Oct 28 2020 01:07:42 GMT+0800 (China Standard Time)

@davecore82 We only run ceph pg dump pgs_brief --format json as far as I recall which should be faster than pg dump. Can you let me know how many PGs and OSDs you have in your cluster? I've never seen that command take that long to return

David Coronel · Answer 5 · Wed Oct 28 2020 01:48:25 GMT+0800 (China Standard Time)

I have 336 OSDs and 18947 PGs.

It takes about 1 second to run ceph pg dump pgs_brief --format json and about 60 seconds to run ceph pg dump --format json.

Alexandre Marangone · Answer 6 · Wed Oct 28 2020 02:03:38 GMT+0800 (China Standard Time)

Thanks! I'll try to have a look at this but this may take a bit

Alexandre Marangone · Answer 7 · Wed Oct 28 2020 02:15:38 GMT+0800 (China Standard Time)

(cc @yuezhu since he's currently working on an issue that feels very similar)

Yue Zhu · Answer 8 · Thu Oct 29 2020 03:11:16 GMT+0800 (China Standard Time)

@davecore82 We just upgraded go-ceph binding, and made the connection short lived by having a connection on demand for each mon command. Would you please build a docker image with the latest nautilus branch via docker build -t digitalocean/ceph_exporter . --no-cache, and see if the hang still exists?

Joshua Baergen · Answer 9 · Sat Oct 31 2020 01:01:02 GMT+0800 (China Standard Time)

Hey @davecore82, we (DO) were also seeing symptoms like what you describe in our environment, and we believe that @yuezhu 's #184 should address the issue - it's looking good so far for us. We've also filed https://tracker.ceph.com/issues/48052 upstream as we believe this may represent an issue with request forwarding in Nautilus.