No timeout when cluster is down

Question

No timeout when cluster is down

jan--f opened this issue 7 years ago · comments

When the cluster is unavailable (because e.g. 2/3 MONs are down) the ceph_exporter seems to never return. Since the ceph_exporter is not actually dependend on a running cluster it would be nicer if it could return an appropriate status. Now a monitoring solution has to rely on a prometheus scrape timeout to "detect this"

Tim Serong · Answer 1 · Thu Jul 13 2017 12:16:03 GMT+0800 (China Standard Time)

This might be annoying. ceph_exporter talks to the cluster via librados, which AFAIK doesn't actually provide any sort of connect timeout functionality itself.

Jan Fajerski · Answer 2 · Thu Jul 13 2017 14:34:52 GMT+0800 (China Standard Time)

But couldn't the ceph_exporter timeout after some time?

Tim Serong · Answer 3 · Thu Jul 13 2017 15:08:14 GMT+0800 (China Standard Time)

Somehow, I guess :-) I was just thinking that might be annoying to implement in ceph_exporter itself if there's not already some suitable timeout functionality inside librados. But I'm speculating, really, not being familiar enough with the codebase.

Sebastian Wagner · Answer 4 · Thu Jul 13 2017 16:10:16 GMT+0800 (China Standard Time)

Ok, in our experience, you cannot timeout librados. The only option is to kill the process. E.g. by spawning a new thread calling kill(getpid(), sig) after a minute or so.

Vaibhav Bhembre · Answer 5 · Fri Jul 14 2017 08:02:58 GMT+0800 (China Standard Time)

Meanwhile has anyone tried playing with rados_{mon,osd}_op_timeout and see if it does their bidding?

Sebastian Wagner · Answer 6 · Mon Jul 17 2017 16:34:33 GMT+0800 (China Standard Time)

Do we really want to force users to set the general librados timeout to fix an issue with monitoring? In any case, afaik this timeout does not work in all cases, like if you shutdown too many OSDs, RBD monitoring will hang indefinitely.

Jan Fajerski · Answer 7 · Fri Jul 21 2017 14:11:56 GMT+0800 (China Standard Time)

Looking into the prometheus docs, it seems like this should be handled more gracefully:
https://prometheus.io/docs/instrumenting/writing_exporters/#failed-scrapes

Jan Fajerski · Answer 8 · Tue Aug 01 2017 22:57:58 GMT+0800 (China Standard Time)

Another reason to address this: After the cluster is down for a while the ceph_exporter will consume all the file handles it is allowed to open. This is then reported in the syslog.

I see two ways of resolving this:

Destroy the cluster handle after each run and re-create it on the next scrape. We should get an error there if the cluster is down. The drawback is that this puts more load on the monitors, but it shouldn't be too bad if not run too often (I think once a minute would be fine imho). Essentially this would behave like runing ceph -s every minute.
Put the actual command execution in a child process and monitor that. If it doesn't return within a timeout, put out the appropriate scrape results and tear down the cluster handle.

I'd favour solution 2 but I suspect this would be more complex to implement. What do you guys think?

Vaibhav Bhembre · Answer 9 · Tue Aug 01 2017 23:29:51 GMT+0800 (China Standard Time)

@sebastian-philipp It is assumed that ceph_exporter is treated as the client of the system and thus expects to have a separate configuration of its own (along with a separate read-only auth user ideally). A container is provided to make it easier to pick a configuration that doesn't need to overlap with your production one that is used for MONs or OSDs. If it works, I don't see any issue using librados timeout since there are no data-path ops that can accidentally be sent over it. It should apply even when several OSDs are down, because the timeout is applied per connection to individual OSD. I would have really liked a better way of injecting timeouts in Ceph calls but we have to make do with what we have.

@jan--f Agreed, that is indeed bad. Using timeouts to allow reclamation should help solve the resource consumption issue to some degree. The problem with recreating handles is Prometheus exporters are not allowed to control scraping intervals. The values for those are decided on the server, and I think it's important we make the best-effort to provide the values at granularities they might be needed across all use-cases. Option 2 sounds better where we have data being gathered within a goroutine that runs separately to the main loop. I will take a swipe at implementing it, but if you already have something in works feel free to make a PR.

This also won't be an issue long-term, because Luminous will expose this information via ceph-mgr. That way regardless of the state of the cluster, as long as the mgr is up we should be able to continually view the state of the system.

Jan Fajerski · Answer 10 · Tue Aug 01 2017 23:39:13 GMT+0800 (China Standard Time)

@neurodrone Please go ahead. I'm fairly ignorant when it comes to go, so it would take me significant time and effort to come up with something.
I also think fixing this will have serious benefits even in a longer time frame, since many people run older ceph versions. Pretty sure there are still some cluster from the hammer (0.9x) release up and running. Upgrading a running ceph cluster can be daunting.

Jan Fajerski · Answer 11 · Wed Feb 14 2018 01:25:06 GMT+0800 (China Standard Time)

Fixed by #80