googleforgames / agones

Dedicated Game Server Hosting and Scaling for Multiplayer Games on Kubernetes

Home Page:https://agones.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

In-place Agones Upgrades: Storage Compatibility

zmerlynn opened this issue · comments

Note

Milestone of #3766, which we are seeking feedback on. We will move forward with pieces that seem non-contentious, though.

Storage Compatibility

Defaulting

Right now, Agones defaults values in the webhook, e.g. GameServer defaulting is the only real thing the GameServer mutation webhook does. Defaulting serves a couple of purposes:

  • Code can rely on default values, e.g. hook for eviction.safe. It’s complicated to constantly have to verify the object has been properly defaulted.
  • Users see the rendered default value in kubectl describe. This increases discoverability for new API elements, as describe is a common UI element.

The problem is that defaulting in the webhook alone is not safe across configuration changes. Example using eviction.safe:

  • t0: Agones deployed at 1.29 with SafeToEvict disabled.
  • t1: GameServer foo created. eviction.safe is not defaulted because the feature gate was not enabled.
  • t2: Agones upgraded to 1.30, SafeToEvict enabled. eviction.safe is defaulted on new objects.
  • t3: controller fails silently for foo, logging unknown eviction safe value “”.

The reason for the failure is that the hook blindly assumes the value was defaulted by the webhook, but the webhook never had a chance to run. Note that for this particular case, the gap here is very narrow in time - in particular t1 and t2 need to occur such that defaulting of the GameServer occurs on a 1.29 agones-extensions container, but a 1.30 agones-controller container attempts to create the Pod. That said, this condition could easily occur during rollout of 1.30, though, and cause a multi-second hiccup.

To solve this problem, I propose we:

  • Change our defaulting code to be idempotent. Right now there is a comment ApplyDefaults applies default values to the GameServer if they are not already populated which suggests that this should be the case, but is immediately followed by two non-idempotent operations (at least non-idempotent over an upgrade).
  • If defaulting is idempotent, then the event handlers for controller can run the defaulting before the GameServer enters the queue, e.g. we can call ApplyDefaults here (and in similar places we Get in the controller - probably via helper function). This effectively ensures defaulting on creation and on subsequent update, but it relies on idempotent defaulting.

Unknown or Disabled Fields

A similar problem exists for API fields that should not be present but have already been set, which typically only occurs if a feature gate has been disabled (either due to downgrade or explicit disablement). This takes a couple of forms:

  • An old-version controller reads a new-version field that it has no knowledge of. Again using the example of eviction.safe, imagine downgrading from 1.30 (where the feature gate is Beta and therefore the API is defaulted) to 1.28 (prior to eviction.safe even existing).
  • A controller has knowledge of a field but the feature gate is disabled. This path results in validation errors on create, but in the case the user actively disabled a feature gate, we need to cover the latent/existing objects as well.

In either of these cases, we need to ensure that on "first touch", the controller drops the unknown fields, rather than preserving them. In general, this is a safer handling of latent unknown fields - otherwise when the feature gate is reenabled, a preserved field could surprise the user. (Note that the first case, where the field is not present at all in the CRD, is generally covered by field pruning, so mostly this is figuring out logic for the latter case.)

Update vs Patch for controllers

Note that controllers have a similar problem to the SDK of using Update vs Patch, mentioned here - different controller versions may drop fields. However:

  • skew between controller versions is not expected to last long (minutes at most), so the problem is short-lived. This is a little different than the sdk-server, which requires a full rollout.
  • we call out in the constraints that some features may flap until rolled out: #3766 (comment)

Apply Default

I like the approach. But you would want to ApplyDefaults() here:

gs, err := c.gameServerLister.GameServers(namespace).Get(name)
if err != nil {
if k8serrors.IsNotFound(err) {
loggerForGameServerKey(key, c.baseLogger).Debug("GameServer is no longer available for syncing")
return nil
}
return errors.Wrapf(err, "error retrieving GameServer %s from namespace %s", name, namespace)
}

Rather than where you specified, as enqueing only enque's the namespace/name of the object, not the object itself - giving the best opportunity to get the latest Object at the time of syncronisation.

Everything else.

LGTM! I like the approach 👍🏻

Apply Default

I like the approach. But you would want to ApplyDefaults() here:

You're right. It might be nice if we had a helper function here that's basically "Get and Default" (for the case of inline changes like you just did for the migration controller), but agreed on the placement.

A couple of proposed changes on this:

We should create a new policy that any new required CRD fields must be non-nullable, and have a default specified in the CRD itself. This means on install all custom resources will immediately have the new field with the default.

  • This covers the case of a new controller requiring a field to be non-nil.
  • Any new non-required fields may be marked nullable: true in the CRD. The expectation here is that a controller will not panic when a non-required field is nil.
  • Reiterating from the main comment that any fields removed in the CRD on upgrade will immediately be pruned (deleted) from all objects by Kubernetes.

I don’t think we need to update the ApplyDefaults() for Game Servers. As Game Servers are inherently ephemeral, there is no need to make changes to the Game Servers. By design we allow for skew between the controller and SDK version, and any field changes can be managed by the CRD itself.

  • The Game Server state is maintained in memory, and is only written back to the kube-apiserver (.GameServers().Update()) during creation, starting, allocation, unhealthy states, and deletion.
  • Any Ready and Allocated Game Servers pods have the existing configuration (with the exception of any new non-nullable and defaulted CRD fields), and will work with the existing SDK image.
    • A side note here that on a Fleet rollout after an Agones upgrade the Fleet will create a new Game Server Set at the new configuration, and all existing Game Servers will eventually be deleted and recreated on the new Game Server Set per the Fleet’s rollout strategy.
  • As new Game Servers are created they will be at the new configuration. This includes all configuration relevant to rendering the Game Server, such as the version, the feature flags, and the sdk-server configuration.

And a few additional comments:

  • The user can initiate a rollout of the Fleet via a label update to change over all Ready Game Servers to the new configuration.
    • The Fleet spec and Game Server Set configuration should upgrade to the new configuration on user initiated Fleet rollout.
    • On update, Fleet annotation should be updated to the most recent SDK version. This is for logging purposes.
  • The Fleet status and Game Server Set status should reflect the new configuration regardless of whether or not a Fleet upgrade rollout has begun.
    • For example, downgrading from an Agones version with CountsAndLists in Beta (on by default) to an Agones version with CountsAndLists in Alpha (off by default) will turn off the feature gate CountsAndLists. The Fleet will drop any status relating to CountsAndLists as the feature is no longer on. This is a clear indication to the user that the feature flag is turned off.