Operator CrashLoop in EKS 1.21

Question

Operator CrashLoop in EKS 1.21

DaemonDude23 opened this issue 3 years ago · comments

What steps did you take and what happened:

Deployed Helm Chart version 0.8.1 with default values, with starboard operator version 0.13.0 and 0.13.1, among a few previous versions.
On any AWS EKS cluster I try to run the operator on that is version 1.21, the operator CrashLoops and throws this error:

{"level":"info","ts":1639588607.0615091,"logger":"main","msg":"Starting operator","buildInfo":{"Version":"0.13.1","Commit":"e9cd6e1467f942ce114468f4d30012bd4256fa9c","Date":"2021-12-01T14:31:52Z","Executabl
e":""}}
{"level":"info","ts":1639588607.0643575,"logger":"operator","msg":"Resolved install mode","install mode":"OwnNamespace","operator namespace":"starboard","target namespaces":}
{"level":"info","ts":1639588607.0653288,"logger":"operator","msg":"Constructing client cache","namespace":"starboard"}
{"level":"error","ts":1639588607.0655043,"logger":"main","msg":"Unable to run starboard operator","error":"getting kube client config: invalid configuration: no configuration has been provided, try setting
KUBERNETES_MASTER environment variable"}

What did you expect to happen:

The Operator to start, become healthy, and begin scans.

Anything else you would like to add:

I have run this operator on previous EKS versions like 1.18, but have never been able to get it to run on 1.21. The version of Kubernetes is the only variable I could find that that determines whether or not the operator will start. I could be misremembering, but I think this same error is thrown on my 1.21 bare-metal homelab as well.

Environment:

Helm Chart version: 0.8.1 - latest
Starboard version (use starboard version): 0.13.0 and 0.13.1
Kubernetes version (use kubectl version): 1.21
OS (macOS 10.15, Windows 10, Ubuntu 19.10 etc): Ubuntu 20.04

Isla Nublar Security · Answer 1 · Thu Apr 28 2022 05:20:26 GMT+0800 (China Standard Time)

As another data point: we've got the Starboard Operator working successfully on EKS 1.21, using default settings, but with slightly later versions of Starboard and the Helm chart.

This is a known working combination:

EKS: Kubernetes version 1.21, platform version eks.4
Managed Node Groups: AMI release version 1.21.5-20220309
Helm chart: 0.10.4
Starboard Operator: 0.15.4

@DaemonDude23 have you tried it with more recent versions?

Andrew Aadland · Answer 2 · Thu Apr 28 2022 05:44:14 GMT+0800 (China Standard Time)

I've tested it after 5-10 releases following my filing of this issue. All the same result, including today once I saw your message. I updated to the latest chart and starboard version, but the same error persists. Side-by-side diffed the latest values against mine. The only diffs I have are for resources and podAnnotations so nothing that should cause this kind of error.
Had the same error on a bare-metal pure k8s homelab as well as a k3s cluster. Surely others would run into this problem, but it seems not and I'm the only outlier.

EKS: Kubernetes version 1.21, platform version eks.4
Self-Managed Node Groups: Bottlerocket AMI ami-02f29c095430282d4 (I recently switched to Bottlerocket, same error as when I was using the standard EKS AMIs)
Helm chart: 0.10.4
Starboard Operator: 0.15.4

Isla Nublar Security · Answer 3 · Thu Apr 28 2022 06:17:47 GMT+0800 (China Standard Time)

Digging a bit further based on your error message:

getting kube client config: invalid configuration: no configuration has been provided, try setting
KUBERNETES_MASTER environment variable

As part of pkg/operator/operator.go:

kubeConfig, err := ctrl.GetConfig()
if err != nil {
	return fmt.Errorf("getting kube client config: %w", err)
}

So it's using the Controller Runtime to find a kubeconfig it can use to connect to the K8s API. Usually when running inside the cluster you don't have to set anything, it will use the in-cluster config and the service account token.

You can see the order of precedence in controller-runtime/config.go:

// GetConfig creates a *rest.Config for talking to a Kubernetes API server.
// If --kubeconfig is set, will use the kubeconfig file at that location.  Otherwise will assume running
// in cluster and use the cluster provided kubeconfig.
//
// It also applies saner defaults for QPS and burst based on the Kubernetes
// controller manager defaults (20 QPS, 30 burst)
//
// Config precedence
//
// * --kubeconfig flag pointing at a file
//
// * KUBECONFIG environment variable pointing at a file
//
// * In-cluster config if running in cluster
//
// * $HOME/.kube/config if exists.

So a couple of possible thoughts:

Have you set any of these other methods of defining a kubeconfig, that might be conflicting with the cluster-provided one?
Have you locked down anything about your cluster to prevent pods from accessing the K8s API? Disabling mounting of service account tokens? Strict network policies?

Isla Nublar Security · Answer 4 · Thu Apr 28 2022 06:33:33 GMT+0800 (China Standard Time)

A few more references down this line of thinking:

Official K8s doc about accessing K8s API inside a Pod
- This mentions that the pod must be able read to a service account token that can access the API, mounted at /var/run/secrets/kubernetes.io/serviceaccount/token
InClusterConfig() from the K8s Go Client library

Andrew Aadland · Answer 5 · Thu Apr 28 2022 06:54:13 GMT+0800 (China Standard Time)

Thanks for all the info.
I found my problem (100% user error). In a manifest that I was using kustomize to patch automountServiceAccountToken: false on the deployment. I thought that would disable auto-mounting of the default service account (which wouldn't be applicable as we're not using the default in-cluster one anyway), not the one explicitly assigned to the pod(s).
I'll do some testing with it tomorrow and will likely close this issue then.

Thanks for jogging my brain to look at service account token mounting! I completely forgot I was using a patch.