cloudprober / cloudprober

An active monitoring software to detect failures before your customers do.

Home Page:http://cloudprober.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to run probes with timeout > interval?

cbroglie opened this issue · comments

With cloudprober v0.11.4, we're able to run DNS probes with interval_msec=100 and timeout_msec=2000. Running probes 10x per second provides enough samples to quickly alert on elevated failure rates while avoiding false positives from occasional drops.

More recent versions added the restriction that timeout must be <= interval, enforcing that there is only a single outstanding probe per interval. While 2s is probably excessive for a DNS probe, we would still like to be able to use a timeout greater than 100ms. The reason is that once the timeout is reached, we lose all visibility into how long requests are actually taking. Therefore, we like to set probe timeouts much higher than we would for an application timeout, and use the latency metrics to alert at lower thresholds than the timeouts.

Would it be possible to support interval < timeout, by allowing multiple outstanding probes to maintain a fixed request rate?

The requests_per_probe option specific to HTTP probes provides close to the desired functionality. It staggers each batch of requests across the interval, and doesn't move on to the next interval until the batch is complete, so it can't guarantee a fixed request rate unless timeout < interval / requests_per_probe. If all requests_per_probe requests were sent at the start of each interval, then you could set interval = timeout, and tune requests_per_probe to achieve the desired average RPS. In my example, it would be interval_msec=2000, timeout_msec=2000, requests_per_probe=20. Of course, this would be more bursty vs. sending 1 request every 100ms.

Hello Chris,

Happy new year! :)

Yeah, we had to make that change because Cloudprober's behavior with timeout > interval is not deterministic. I guess it works fine when very few probes take longer than "interval" to finish, but if that happens consistently, metrics will become inconsistent.

As you said, we can probably do the same thing for DNS probes as we do for HTTP probes, to allow high frequency probing. In HTTP probes, we use the following formula:

     (requests_per_probe * requests_interval) + timeout < interval

You could probably use the following values for your use case:

request_per_probe = 50
per_request_interval = 80 ms
timeout = 999 ms
interval = 5000 ms

This way you'll be able to run 50 probes per 5 sec. You can also tune it a bit.. increase interval between requests and reduce timeout a bit.

There will still a be gap of timeout in every probe interval though (20% in this case), but impact of that can be minimized by increasing the interval and number of requests in that interval, while timeout remains constant.

To minimize time gaps further, you can possibly run replicas of the same probe --

   {
      name: "my_dns_probe_00"
      ...
  }
  {
      name: "my_dns_probe_01"
      ..
   }

What do you think?

Happy new year! 😄

There will still a be gap of timeout in every probe interval though (20% in this case), but impact of that can be minimized by increasing the interval and number of requests in that interval, while timeout remains constant.

That's a good point about using a larger interval and number of requests to amortize the timeout. I think that should work well enough in practice for achieving a fixed rate of requests.

Cool, I've prepared a PR to add parallel requests to DNS probes:
#670

I'll do some more testing and add some tests before merging.

Thanks - #670 LGTM

@cbroglie I've merged #670. Give it a try if you can, using the 'main' docker tag or pre-release binaries. I'm planning to cut a new release in a couple of weeks if you want to wait for that.

Gave docker.io/cloudprober/cloudprober:main a shot and everything seems to work as expected, thanks again!

Awesome. Great to hear.

I'll close this now. I am targeting the new release in about 2-3 weeks, but may end up releasing sooner.