solo-io / gloo

The Feature-rich, Kubernetes-native, Next-Generation API Gateway Built on Envoy

Home Page:https://docs.solo.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Adjust and Document Default Validation API

sam-heilbron opened this issue · comments

Gloo Edge Product

Open Source

Gloo Edge Version

v1.17.0

Is your feature request related to a problem? Please describe.

I want to depend on the configuration validation that is built into the product. At the moment, some of the values that are used to configure validation are not clearly documented, and some of the default values are confusing to understand.

Describe the solution you'd like

AlwaysAcceptResources defaults to false

API. This field allows users to enable validation, and in-essence have in perform a no-op. When this is set to true, and invalid resource will be processed by the validation API, but the webhook will "always accept" the resource.

This value defaults to true, largely for backwards compatibility reasons from when it was first introduced (this way users could upgrade Gloo, and their behavior wouldn't change). Given how long the feature has been in the product API for, and how widely used validation is, I think it makes sense to change this default to false. In fact, many users may think they are enabling validation, but not realize that although they have enabled it, it is having no impact due to this setting.

There is still a validation.enabled API, which allows you to disable validation altogether. And if you really wanted to revert to this behavior, you could set alwaysAccept=false.

Document all Validation Knobs

Our Helm API supports a number of fields for configuring the Validation API and the Validation Webhook: https://github.com/solo-io/gloo/blob/main/install/helm/gloo/generate/values.go#L388. These should not only be documented inline in the Helm API (as they are), but also captured in a user-facing guide like: https://docs.solo.io/gloo-edge/latest/guides/traffic_management/configuration_validation/admission_control/

This should include a combination of what options we support, as well as production recommendations.

Describe alternatives you've considered

No response

Additional Context

No response

This should definitely be better documented

I initially thought the change of the default value for alwaysAcceptResources seemed reasonable but having dived deeper I think we probably shouldn’t make a change, as the potential to break users moving forward would have similar or worse consequences

This value defaults to true, largely for backwards compatibility reasons from when it was first introduced (this way users could upgrade Gloo, and their behavior wouldn't change)

The field dates to the same PR as validation overall, at a point where afiact there weren’t default values, so I'm not sure what backwards compatibility issues there could have been

I think it makes sense to change this default to false. In fact, many users may think they are enabling validation, but not realize that although they have enabled it, it is having no impact due to this setting.

We also have validation enabled by default so if we only change the default for alwaysAcceptResources the default will be to have validation on and enforced, which could unexpectedly break users who have invalid resources in their snapshots and may or may not intend to have validation enabled

Therefore it might make more sense to have both default to false, so that at worst case users lose the logging of aborted validation decisions if they’re relying on the previous default behavior, and then get full validation if they opt-in to validation

This, however, runs the additional risk of breaking behavior for users who have configured other validation settings (such as setting alwaysAcceptResources to false) without touching enabled, having relied on the default behavior there
In that case validation would quietly turn off which recreates the scenario we’re trying to avoid whereby users think the default values will cause validation to be enforced and actually they won’t

If there’s a way forward it definitely entails loud warnings in the changelog and potentially a formal deprecation of alwaysAcceptResources and introduction of a new field, though that may be overkill

I agree that this change would be considered a BREAKING_CHANGE to some users, however I think it should be a small subset of users. Additionally, I think it is a clearer user experience for new users, and in a way would force existing users which this might impact to review their user-defined values. From a product perspective, I would advocate for the preferred API (validation is on by default, and resources are actually validated by default), even if it comes at the cost of a potentially more challenging upgrade (setting 1 additional Helm value)

Below are the potential cases that I see (though please let me know if I've missed any cases):

1. I'm a new user who installed Gloo Gateway with alwaysAccept=[unset]

Today, validation.enabled defaults to true, and alwaysAccept is true. This means that I do not get any of the benefits of validation, even though it looks like it's enabled. I am able to push bad configuration to my cluster.

With this change, to default alwaysAccept to false, I would still get validation for free, but this time, I would be unable to push bad resources to my cluster.

Since I am a new user, this is NOT a breaking change to me.

2. I'm an existing user who installed Gloo Gateway with validation.enabled=false

This change in default behavior would NOT affect me, because I am not using validation.

As a developer, I would be curious who has this feature disabled, as it is extremely helpful for preventing issues in the cluster.

Since I am am not using validation, this is NOT a breaking change to me.

3. I'm an existing user who installed Gloo Gateway with alwaysAccept=true

This change in default behavior would NOT affect me, because I have set the value explicitly as an override.

As a developer, I would be curious why a user would want this, because there is latency associated with validating resources, and no added safety if we always accept them.

Since I am have defined an override explicitly, this is NOT a breaking change to me.

4. I'm an existing user who installed Gloo Gateway with alwaysAccept=false

This change in default behavior would not affect me, because I have set the value explicitly as an override.

If you are using validation, this is the recommended setting (which is why I would advocate that we make it the default).

Since I am have defined an override explicitly, this is NOT a breaking change to me.

4. I'm an existing user who installed Gloo Gateway with alwaysAccept=[unset] and all valid resources in my cluster

This change in default behavior would NOT affect me, because there are no bad resources in my cluster.

Today, I can push bad resources to the cluster and they are accepted. However, by changing this default, I will not be affected because the cluster is already in an OK state. So the next configuration that is processed, if it is ok, it will be accepted.

Since I have an API Snapshot with no errors, this is NOT a breaking change to me.

5. I'm an existing user who installed Gloo Gateway with alwaysAccept=[unset] and at least 1 invalid resource in my cluster

This change in default behavior WOULD affect me, because I have bad resources in my cluster.

Today, I can continue to push configuration to my cluster, even though some resources are invalid. However, by changing this default, I will no longer be able to push any configuration to my cluster, until I resolve the issues.

I have two choices:

  1. Explicitly set alwaysAccept=true, which was the previous default
  2. Resolve the issues in configuration

Since I have an API Snapshot with at least 1 error, this IS a breaking change to me.