open-policy-agent / gatekeeper

🐊 Gatekeeper - Policy Controller for Kubernetes

Home Page:https://open-policy-agent.github.io/gatekeeper/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Constraint status content is non-deterministic

prachirp opened this issue · comments

Describe the solution you'd like
When viewing the status of a constraint with violations, some arbitrary subset of totalViolations are listed in the status. This subset may change each time the cluster is audited. It's preferable to sort violations so the same subset is listed in constraint status, this way we can determine that that violation was remediated or no longer exists.

Anything else you would like to add:
Seems like a few line fix to the audit manager updateConstraintStatus method to sort auditResults before appending to constraint status. https://github.com/open-policy-agent/gatekeeper/blob/master/pkg/audit/manager.go#L863

Environment:

  • Gatekeeper version:
  • Kubernetes version: (use kubectl version):

@prachirp you can use constraint-violations-limit flag to increase the number of violations you are getting in status to maximum of 500, by default it is 20, to get around this problem. If you are noticing that there might be more violations per constraint than 500 then you may be able to use pubsub to export violations.

Adding to @JaydipGabani 's comment, another option would be to read violations from the audit pod's logs.

WRT the specific proposal... unfortunately I don't think sorting violations solves the issue:

  • Assume we have a constraint with 50 violations and a cap of 20 maximum reported violations
  • Let's also assume the violations are stably sorted (the suggestion)
  • Lets assume some new violation, AAAA is created. According to the sorting rules, it comes first. This pushes violation ZZZZ off the list of violations reported on the constraint.
  • It is impossible to know whether violation ZZZZ has been removed because it has been fixed or if it has merely been pushed off the list. This is the original problem all over again.

It may be worth sorting violations just to have more stable results, but unfortunately doing so doesn't allow us to make any conclusions about whether a violation has been remediated.

Ultimately, the audit report on the status of the resource is informational, but not authoritative. Due to scaling issues it will likely never be authoritative.

@prachirp to be correct, limitation is not 500 violations per constraint. Gatekeeper could report violations on the constraints till etcd limit is respected for each object. By default, each object is allowed have maximum of 1.5 MiB - more info here. So, gatekeeper can report as many violations on one constraint resource as long as the size of the object remains less than etcd limit.

Hey @JaydipGabani @maxsmythe thanks for responding! For some background, the violations from status are sufficient enough that we won't move to alternate sources.

Secondly, we do see policies with thousands of total violations and know we will always miss some violations in constraint status as a result no matter the violation per constraint limit. This is also sufficient.

The main reason to propose this is to reduce churn in the violations being reported in constraint status and have the results be slightly more predictable. I was wrong earlier since a finding which disappeared from the sorted list won't indicate that the finding was remediated. But sorting the list will help us understand if the violations being reported are fully churned or not. Currently the list of total violations might be the same but the ones reported in constraint status could be fully churned. This will no longer be the case with a sorted list. Being able to tell this information would help us a lot.

Sorting violations would help reason a bit easier. For example compare that constraint A had 40 findings starting with X,Y,Z at time 0 and later had 40 findings starting with A,X,Y at time 1. We could easily see that the state had changed between those times whereas before we can only see that there are 40 total findings both times and couldn't tell that there was churn. A concrete case could be a cluster restart happening between time 0 and time 1. We can be more confident that the cluster has reached eventual consistency when the first 20 out of total 40 findings reported in constraint status exactly match the same first 20 out of 40 findings reported before cluster restart.

Hope the background helps! And let me know if there is any reason not to take the proposed solution of sorting auditResults before appending to constraint status.

I think this makes sense, if nothing else it has the potential to make diffs less noisy. As long as it's in a place where performance impact is limited, SGTM.