digitalocean / ceph_exporter

Prometheus exporter that scrapes meta information about a ceph cluster.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

No timeout when cluster is down

jan--f opened this issue · comments

When the cluster is unavailable (because e.g. 2/3 MONs are down) the ceph_exporter seems to never return. Since the ceph_exporter is not actually dependend on a running cluster it would be nicer if it could return an appropriate status. Now a monitoring solution has to rely on a prometheus scrape timeout to "detect this"

This might be annoying. ceph_exporter talks to the cluster via librados, which AFAIK doesn't actually provide any sort of connect timeout functionality itself.

But couldn't the ceph_exporter timeout after some time?

Somehow, I guess :-) I was just thinking that might be annoying to implement in ceph_exporter itself if there's not already some suitable timeout functionality inside librados. But I'm speculating, really, not being familiar enough with the codebase.

Ok, in our experience, you cannot timeout librados. The only option is to kill the process. E.g. by spawning a new thread calling kill(getpid(), sig) after a minute or so.

Meanwhile has anyone tried playing with rados_{mon,osd}_op_timeout and see if it does their bidding?

Do we really want to force users to set the general librados timeout to fix an issue with monitoring? In any case, afaik this timeout does not work in all cases, like if you shutdown too many OSDs, RBD monitoring will hang indefinitely.

Looking into the prometheus docs, it seems like this should be handled more gracefully:
https://prometheus.io/docs/instrumenting/writing_exporters/#failed-scrapes

Another reason to address this: After the cluster is down for a while the ceph_exporter will consume all the file handles it is allowed to open. This is then reported in the syslog.

I see two ways of resolving this:

  1. Destroy the cluster handle after each run and re-create it on the next scrape. We should get an error there if the cluster is down. The drawback is that this puts more load on the monitors, but it shouldn't be too bad if not run too often (I think once a minute would be fine imho). Essentially this would behave like runing ceph -s every minute.

  2. Put the actual command execution in a child process and monitor that. If it doesn't return within a timeout, put out the appropriate scrape results and tear down the cluster handle.

I'd favour solution 2 but I suspect this would be more complex to implement. What do you guys think?

@sebastian-philipp It is assumed that ceph_exporter is treated as the client of the system and thus expects to have a separate configuration of its own (along with a separate read-only auth user ideally). A container is provided to make it easier to pick a configuration that doesn't need to overlap with your production one that is used for MONs or OSDs. If it works, I don't see any issue using librados timeout since there are no data-path ops that can accidentally be sent over it. It should apply even when several OSDs are down, because the timeout is applied per connection to individual OSD. I would have really liked a better way of injecting timeouts in Ceph calls but we have to make do with what we have.

@jan--f Agreed, that is indeed bad. Using timeouts to allow reclamation should help solve the resource consumption issue to some degree. The problem with recreating handles is Prometheus exporters are not allowed to control scraping intervals. The values for those are decided on the server, and I think it's important we make the best-effort to provide the values at granularities they might be needed across all use-cases. Option 2 sounds better where we have data being gathered within a goroutine that runs separately to the main loop. I will take a swipe at implementing it, but if you already have something in works feel free to make a PR.

This also won't be an issue long-term, because Luminous will expose this information via ceph-mgr. That way regardless of the state of the cluster, as long as the mgr is up we should be able to continually view the state of the system.

@neurodrone Please go ahead. I'm fairly ignorant when it comes to go, so it would take me significant time and effort to come up with something.
I also think fixing this will have serious benefits even in a longer time frame, since many people run older ceph versions. Pretty sure there are still some cluster from the hammer (0.9x) release up and running. Upgrading a running ceph cluster can be daunting.

Fixed by #80