jacksontj / promxy

An aggregating proxy to enable HA prometheus

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add config option to optionally not hit all nodes in a server_group

jacksontj opened this issue · comments

Separately from the scatter-gather, promxy hits each node in a given server group and merges there results together. In the case where the first result has no "holes" (as defined by the anti-affinity config) then it doesn't look at the second response. Right now the query is sent to both for consistent load/performance, but if there where more hosts in the server_group (>2 -- something like 10) then hitting all nodes would be excessive. It seems prudent to add some configs:

  1. parallel server fetch count -- how many servers to send the initial request to
  2. max server fetch count -- how many we'll continue sending until

With these a user could control (1) how many servers to hit and (2) if promxy should hold the request trying to get the data from more servers.

@jacksontj it's not clear from the documentation, but is the option available?

commented

this feature is important to improve queries concurrently. a large monitor system will have a large number of queries by the alert system or others, and we will think of saving duplicate data as replication to improve read performance, so I think hosts in service_group should not be the only purpose for HA.
we can set initial_request_count > 2 to guarantee queries HA, and more duplicates to improve query performance.

So as of today there are a number of features (e.g. #560) which reduce the servergroups that a given query needs to hit. This issue is specifically on reducing the number of requests within a servergroup.

The main complexity here is that promxy has no idea what "correct" looks like. The merge logic today will basically hit all nodes within a servergroup and merge data (merging the series as well as merging the points within a series if there are holes).

So to highligh the level of issues, here are a couple scenarios that we'd need to cover:

  • a server has a hole for a given labelset
  • a server is missing a labelset and data

As there is no knowledge of what a complete dataset looks like -- it seems impossible to guarantee that the result is complete/correct without hitting all of the nodes within the group (as "a server" could be the last one we query -- as such missing it would make for an incomplete dataset). Given that promxy is a monitoring/alerting tool it definitely leans on correctness over all else.

So, with all of that context I think I'm going to close out this issue as I don't see a way forward with this that doesn't fundamentally compromise the correctness of the subsequent data.

If anyone has other ideas/suggestions feel free to chime in, but for now I'll consider this "won't do".