aquasecurity / starboard

Moved to https://github.com/aquasecurity/trivy-operator

Home Page:https://aquasecurity.github.io/starboard/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Excessive secret resources generation issue with starborad scanning

gurugautm opened this issue · comments

What steps did you take and what happened:

Environment : OpenShift v.4.7
Aqua v6.2.x
Aqua Enforcer installed with non-privileged mode
Kube Enforcer with starboard installed

When we perform an scan using starboard, it created a scan job and a secret. But when scan failed secret didn't get deleted. In customer env its not deleted even when scan is successful.
due to this It created multiple secrets around 80k in customer env.

What did you expect to happen:

Temp secrets should be auto-deleted even when scan is successful or failed.

I'm seeing the same symptoms on EKS, hundreds of secrets created in the starboard namespace.

Environment: EKS (1.20)
Starboard-Operator 0.13.2

It would be very helpful to see some logs streamed by Starboard Operator's pod and minimal reproduction steps on upstream K8s cluster. We have limited capacity to support managed platforms with custom configurations. In particular, I'd like to see what is the root cause of scan jobs failing, which probably prevents us from cleaning up orphaned Secrets properly. I can only assume it's related to some PSP or admission control that prevents scan jobs from running successfully, but we need more details to advice.

It's also very useful to look at events created in the starboard-system namespace with kubectl get events -n starboard-system. (Sometimes pods do not have enough information in under ContainerStatuses, but we can figure from events why certain pods failed.)

Today this killed the secrets api in one of our clusters ....

kubectl get secrets -n starboard-operator | grep -c Opaque
Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request (get secrets)
7500

@danielpacak
In our case the root cause was the github api limit while updating ...

When deleting the starboard namespace I found that there were far over 20k secrets created in 9 days

Edit: So apparently I was not able to really disable polaris. I just removed our ImageReference, which somehow delayed the error messages. Not sure how that works but that explains why initially didn't see secrets being created.

In our case the issue was the plugin "polaris" which kept failing. I let it run for a day in which it produced 203 Error Logs in the starboard-operator and left 1569 secrets behind. I'm not sure how these numbers correlate, maybe there are 7-8 retries on average? Removing the plugin stopped the errors and stopped leaving secrets behind.

The secrets which are left behind contain the values worker.password and worker.username

I'm unable to further track down the issue because the scan jobs immediately die and won't leave logs. Here is the logentry from the starboard-operator (reformatted for better readability):

{
  "level": "error",
  "ts": 1646135225.9482412,
  "logger": "reconciler.configauditreport",
  "msg": "Scan job container",
  "job": "starboard-operator/scan-configauditreport-797f6d9d6d",
  "container": "polaris",
  "status.reason": "Error",
  "status.message": "",
  "stacktrace": "github.com/aquasecurity/starboard/pkg/operator/controller.(*ConfigAuditReportReconciler).reconcileJobs.func1
	/home/runner/work/starboard/starboard/pkg/operator/controller/configauditreport.go:363
sigs.k8s.io/controller-runtime/pkg/reconcile.Func.Reconcile
	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/reconcile/reconcile.go:102
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:227"
}

Versions we have been using:
Environment: AWS EKS 1.21
Starboard-Operator: aquasec/starboard-operator:0.14.1
Starboard-Operator Helm Chart: 0.9.1
Trivy: aquasec/trivy:0.24.0
Trivy Helm Chart: 0.4.11
Polaris: fairwinds/polaris:5.0

I found the parameter to disable polaris and let it run for over an hour. So far no error logs regarding polaris and also no secrets being stuck. Alternatively I tried to switch to Conftest instead of Polaris but received different errors and abandoned the idea.

Here is the parameter to disable configAuditScanner and therefore polaris as well.

operator: {
              configAuditScannerEnabled: false,
}

Thank you for the feedback @MPritsch We are actually working on so called built-in configuration audit scanner that is going to displace Polaris and Conftest plugins in the upcoming release. It won't create Kubernetes Job objects nor Secrets and it will be much faster. See #971 for more details.

Same issue, on cluster with smart jobs. (with private registry)
ex : Job is created, scan beginning , job terminated and deleted before the end of scan. The secret remains.

We now have a working version with polaris. The underlying issue were missing IAM permissions. We also needed to use polaris 4.2 instead of 5.0.

Every starup of the starboard-operator we received "401 Unauthorized: Not Authorized" error for AWS Images from ECR. E.g.:

{
  "level": "error",
  "ts": 1646123793.6945415,
  "logger": "reconciler.vulnerabilityreport",
  "msg": "Scan job container",
  "job": "starboard-operator/scan-vulnerabilityreport-6cd9546b84",
  "container": "fluent-bit",
  "status.reason": "Error",
  "status.message": "2022-03-01T08:36:33.091Z\t\u001b[31mFATAL\u001b[0m\tscanner initialize error: unable to initialize the docker scanner: 3 errors occurred:
	* unable to inspect the image (906394416424.dkr.ecr.eu-central-1.amazonaws.com/aws-for-fluent-bit:2.21.5): Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
	* unable to initialize Podman client: no podman socket found: stat podman/podman.sock: no such file or directory
	* GET https://906394416424.dkr.ecr.eu-central-1.amazonaws.com/v2/aws-for-fluent-bit/manifests/2.21.5: unexpected status code 401 Unauthorized: Not Authorized",
  "stacktrace": "github.com/aquasecurity/starboard/pkg/operator/controller.(*VulnerabilityReportReconciler).reconcileJobs.func1
	/home/runner/work/starboard/starboard/pkg/operator/controller/vulnerabilityreport.go:320
sigs.k8s.io/controller-runtime/pkg/reconcile.Func.Reconcile
	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/reconcile/reconcile.go:102
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:227"
}

These were the Images which produced the error. The account IDs are from AWS, not from us:

602401143452.dkr.ecr.eu-central-1.amazonaws.com/amazon/aws-load-balancer-controller:v2.3.1
602401143452.dkr.ecr.eu-central-1.amazonaws.com/amazon-k8s-cni:v1.10.1
602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/kube-proxy:v1.21.2-eksbuild.2
602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/coredns:v1.8.4-eksbuild.1
906394416424.dkr.ecr.eu-central-1.amazonaws.com/aws-for-fluent-bit:2.21.5

The solution for these images was giving following permissions to starboard (as described by the 'Important' block here https://docs.aws.amazon.com/AmazonECR/latest/userguide/repository-policy-examples.html)

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "ecr:BatchCheckLayerAvailability",
                "ecr:GetDownloadUrlForLayer",
                "ecr:GetRepositoryPolicy",
                "ecr:DescribeRepositories",
                "ecr:ListImages",
                "ecr:BatchGetImage"
            ],
            "Resource": [
                "arn:aws:ecr:*:602401143452:repository/*",
                "arn:aws:ecr:*:906394416424:repository/*"
            ],
            "Effect": "Allow"
        },
        {
            "Action": "ecr:GetAuthorizationToken",
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

This probably would have been much easier to debug with proper error messages from polaris...

We've marked this issue as won't fix because we merged #971 that performs configuration audits without creating Kubernetes Jobs and Secrets. We call it a built-in configuration audit scanner and it will be enabled by default in the upcoming v0.15.0 release. Polaris and Conftest will be deprecated at some point.

We'll keep this issue open until v0.15.0 is released.

Just a quick update. While we were able to fix the secret creation and errors on one cluster, another one keeps creating secrets. Not sure if this is a permission problem again, although we don't see any errors regarding them. We will disable Polaris completely and wait for your v0.15.0 release to replace it.