Add nodeSelector for Scan Jobs

Question

Add nodeSelector for Scan Jobs

dschunack opened this issue 3 years ago · comments

What steps did you take and what happened:

At the moment is it not possible to add a NodeSelector to run the scan jobs only on Linux Workers.
We have also Windows workers and without a NodeSelector like this:

nodeSelector:
  kubernetes.io/os: linux

The PODs are placed on a Windows Worker node instead of a Linux Worker. This will fail and PODs are hang in the Creating State for hours. It's not possible at the moment to run Linux Containers on Windows or conversely.

kubectl -n starboard-operator get pods        
NAME                                        READY   STATUS              RESTARTS   AGE
scan-vulnerabilityreport-5d478ff6c5-m9nn9   0/1     Completed           0          24h
scan-vulnerabilityreport-5f655c6dd9-jjjsj   0/1     ContainerCreating   0          2h
scan-vulnerabilityreport-6b8d6989fb-9pc5c   0/3     ContainerCreating   0          2h
scan-vulnerabilityreport-6cfc49d9f5-bsr9w   0/1     ContainerCreating   0          2h

We set the NodeSelector in the helm chart values, but it's only used for the operator itself and not for the scheduled PODs by the Operator. Is it possible to added this in the Operator ?

What did you expect to happen:
PODs placed by the Operator based on a NodeSelector to run it only on Linux.

Environment:

Starboard version (use starboard version): Helm Chart version 0.7.0
Kubernetes version (use kubectl version): EKS 1.20
OS (macOS 10.15, Windows 10, Ubuntu 19.10 etc):

Daniel Pacak · Answer 1 · Fri Oct 15 2021 20:10:51 GMT+0800 (China Standard Time)

@krol3 Can you have a look at this one? I remember some time back we added node labels and affinities to handle Windows and Linux nodes. Is there anything missing or is it a regression?

Carol Valencia · Answer 2 · Mon Oct 18 2021 21:20:53 GMT+0800 (China Standard Time)

@danielpacak I will take a look, I remembered tested using the starboard client.

Daniel Pacak · Answer 3 · Fri Oct 22 2021 16:15:21 GMT+0800 (China Standard Time)

👋 @dschunack I think it would be pretty easy to extend Starboard configuration to allow setting custom nodeSelectors for scan pods. However, before we do that I'm wondering why the nodeAffinity allows scheduling pods on Windows workers even if we set it for each vulnerability scan job?

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2021-10-22T08:00:49Z"
  generateName: scan-vulnerabilityreport-74ddf5fb6-
  labels:
    app.kubernetes.io/managed-by: starboard
    controller-uid: 872baf5a-f157-4160-b79d-9d1ac76661d1
    job-name: scan-vulnerabilityreport-74ddf5fb6
    resource-spec-hash: 74777446d5
    starboard.resource.kind: Deployment
    starboard.resource.name: nginx
    starboard.resource.namespace: default
    vulnerabilityReport.scanner: "true"
  name: scan-vulnerabilityreport-74ddf5fb6-swb9z
  namespace: starboard
  ownerReferences:
  - apiVersion: batch/v1
    blockOwnerDeletion: true
    controller: true
    kind: Job
    name: scan-vulnerabilityreport-74ddf5fb6
    uid: 872baf5a-f157-4160-b79d-9d1ac76661d1
  resourceVersion: "350275"
  uid: 119bf51f-b129-4327-b491-90fb2c78573f
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/os
            operator: In
            values:
            - linux
  automountServiceAccountToken: false
# [trimmed output...]

Daniel Schunack · Answer 4 · Fri Oct 22 2021 23:10:20 GMT+0800 (China Standard Time)

Hi,

the "nodeAffinity" is also fine for us, but it's not set at the moment on the Scan Pods.
Is the "nodeAffinity" a extra env variable? Maybe, something is missing in our cm or deployment of the starboard operator.
We use starboard operator 0.12.0

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/psp: eks.privileged
  creationTimestamp: "2021-10-22T15:03:36Z"
  generateName: scan-vulnerabilityreport-5f655c6dd9-
  labels:
    app.kubernetes.io/managed-by: starboard
    controller-uid: be42db70-a7f2-4318-b377-24563fdc4457
    job-name: scan-vulnerabilityreport-5f655c6dd9
    resource-spec-hash: 968f9cf69
    starboard.resource.kind: ReplicaSet
    starboard.resource.name: vpc-resource-controller-6677fc869b
    starboard.resource.namespace: kube-system
    vulnerabilityReport.scanner: "true"
  name: scan-vulnerabilityreport-5f655c6dd9-5j5tn
  namespace: starboard-operator
  ownerReferences:
  - apiVersion: batch/v1
    blockOwnerDeletion: true
    controller: true
    kind: Job
    name: scan-vulnerabilityreport-5f655c6dd9
    uid: be42db70-a7f2-4318-b377-24563fdc4457
  resourceVersion: "41225140"
  uid: 3fcb2f89-7d8c-4e21-980c-85c80d654545
spec:
  automountServiceAccountToken: false
  containers:
  - args:
    - --quiet
    - client

Daniel Pacak · Answer 5 · Fri Oct 22 2021 23:33:11 GMT+0800 (China Standard Time)

Hmm looking at the code of v0.12.0 we set nodeAffinity in Trivy plugin only in Standalone mode. Do you happen to configure Trivy plugin in ClientServer mode? That would explain the mystery...

Daniel Schunack · Answer 6 · Fri Oct 22 2021 23:45:34 GMT+0800 (China Standard Time)

Yes, we use starboard in the ClientServer mode connected to a central Trivy Server.

Daniel Pacak · Answer 7 · Fri Oct 22 2021 23:56:26 GMT+0800 (China Standard Time)

Got it. Okey so the fix would be one liner - similar to what we do in Standalone mode https://github.com/aquasecurity/starboard/blob/v0.12.0/pkg/plugin/trivy/plugin.go#L558

Daniel Schunack · Answer 8 · Wed Oct 27 2021 23:38:29 GMT+0800 (China Standard Time)

Hi @danielpacak ,

I tested the new startboard-operator version 0.13.0-rc1 but the nodeAffinity is still not set.

kubectl -n starboard-operator get cm starboard-trivy-config -o yaml
apiVersion: v1
data:
  trivy.ignoreUnfixed: "true"
  trivy.imageRef: docker.io/aquasec/trivy:0.19.2
  trivy.mode: ClientServer
  trivy.resources.limits.cpu: 500m
  trivy.resources.limits.memory: 500M
  trivy.resources.requests.cpu: 100m
  trivy.resources.requests.memory: 100M
  trivy.serverURL: http://xxxxxx:8080
  trivy.severity: MEDIUM,HIGH,CRITICAL
kind: ConfigMap

kubectl -n starboard-operator get pod scan-vulnerabilityreport-76cc4759bc-7hdqb -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/psp: eks.privileged
  creationTimestamp: "2021-10-27T15:09:05Z"
  generateName: scan-vulnerabilityreport-76cc4759bc-
  labels:
    app.kubernetes.io/managed-by: starboard
    controller-uid: 81241fa9-0fd1-45bc-a263-88582f57b68c
    job-name: scan-vulnerabilityreport-76cc4759bc
    resource-spec-hash: 76fb94fb5
    starboard.resource.kind: ReplicaSet
    starboard.resource.name: starboard-operator-647499bfb7
    starboard.resource.namespace: starboard-operator
    vulnerabilityReport.scanner: "true"
  name: scan-vulnerabilityreport-76cc4759bc-7hdqb
  namespace: starboard-operator
  ownerReferences:
  - apiVersion: batch/v1
    blockOwnerDeletion: true
    controller: true
    kind: Job
    name: scan-vulnerabilityreport-76cc4759bc
    uid: 81241fa9-0fd1-45bc-a263-88582f57b68c
  resourceVersion: "43393052"
  uid: 31299ac4-b5f9-41ac-a148-65d54dc33a4c
spec:
  automountServiceAccountToken: false
  containers:
  - args:
    - --quiet
    - client
    - --format
    - json
    - --remote
    - http://XXXXX:8080
    - public.ecr.aws/aquasecurity/starboard-operator:0.13.0-rc1
    command:

Daniel Pacak · Answer 9 · Wed Oct 27 2021 23:46:01 GMT+0800 (China Standard Time)

Sorry to hear that @dschunack and let me check again ...

Daniel Pacak · Answer 10 · Thu Oct 28 2021 14:57:29 GMT+0800 (China Standard Time)

@dschunack Can you double check and share the whole scan pod descriptor? In my env the node affinity is there:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2021-10-28T06:52:20Z"
  generateName: scan-vulnerabilityreport-866469b84d-
  labels:
    app.kubernetes.io/managed-by: starboard
    controller-uid: 658e40f6-e0f9-4d4b-b493-1abf0f516ef6
    job-name: scan-vulnerabilityreport-866469b84d
    resource-spec-hash: 7cb64cb677
    starboard.resource.kind: ReplicaSet
    starboard.resource.name: nginx-6d4cf56db6
    starboard.resource.namespace: default
    vulnerabilityReport.scanner: Trivy
  name: scan-vulnerabilityreport-866469b84d-mwhpr
  namespace: starboard
  ownerReferences:
  - apiVersion: batch/v1
    blockOwnerDeletion: true
    controller: true
    kind: Job
    name: scan-vulnerabilityreport-866469b84d
    uid: 658e40f6-e0f9-4d4b-b493-1abf0f516ef6
  resourceVersion: "13628"
  uid: 95d13eef-9b28-4903-917c-1ea6f77ba5c8
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/os
            operator: In
            values:
            - linux
  automountServiceAccountToken: false
  containers:
  - args:
    - --quiet
    - client
    - --format
    - json
    - --remote
    - http://trivy.trivy:8928
    - nginx:1.16
    command:
    - trivy
    env:
    - name: HTTP_PROXY
      valueFrom:
        configMapKeyRef:
          key: trivy.httpProxy
          name: starboard-trivy-config
          optional: true
    - name: HTTPS_PROXY
      valueFrom:
        configMapKeyRef:
          key: trivy.httpsProxy
          name: starboard-trivy-config
          optional: true
    - name: NO_PROXY
      valueFrom:
        configMapKeyRef:
          key: trivy.noProxy
          name: starboard-trivy-config
          optional: true
    - name: TRIVY_SEVERITY
      valueFrom:
        configMapKeyRef:
          key: trivy.severity
          name: starboard-trivy-config
          optional: true
    - name: TRIVY_IGNORE_UNFIXED
      valueFrom:
        configMapKeyRef:
          key: trivy.ignoreUnfixed
          name: starboard-trivy-config
          optional: true
    - name: TRIVY_SKIP_FILES
      valueFrom:
        configMapKeyRef:
          key: trivy.skipFiles
          name: starboard-trivy-config
          optional: true
    - name: TRIVY_SKIP_DIRS
      valueFrom:
        configMapKeyRef:
          key: trivy.skipDirs
          name: starboard-trivy-config
          optional: true
    - name: TRIVY_TOKEN_HEADER
      valueFrom:
        configMapKeyRef:
          key: trivy.serverTokenHeader
          name: starboard-trivy-config
          optional: true
    - name: TRIVY_TOKEN
      valueFrom:
        secretKeyRef:
          key: trivy.serverToken
          name: starboard-trivy-config
          optional: true
    - name: TRIVY_CUSTOM_HEADERS
      valueFrom:
        secretKeyRef:
          key: trivy.serverCustomHeaders
          name: starboard-trivy-config
          optional: true
    image: docker.io/aquasec/trivy:0.20.0
    imagePullPolicy: IfNotPresent
    name: nginx
    resources:
      limits:
        cpu: 500m
        memory: 500M
      requests:
        cpu: 100m
        memory: 100M
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: kind-control-plane
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: starboard
  serviceAccountName: starboard
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2021-10-28T06:52:20Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2021-10-28T06:52:23Z"
    message: 'containers with unready status: [nginx]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2021-10-28T06:52:23Z"
    message: 'containers with unready status: [nginx]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2021-10-28T06:52:20Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://b48b478196ad1c00cdafc31c3b57b2d05ab7014be2ab621ca909a7195b529542
    image: docker.io/aquasec/trivy:0.20.0
    imageID: docker.io/aquasec/trivy@sha256:76d47e5917c583fcad5ab4f83a23cb5e534c34649a994c73722fe6dfd86f2855
    lastState: {}
    name: nginx
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: containerd://b48b478196ad1c00cdafc31c3b57b2d05ab7014be2ab621ca909a7195b529542
        exitCode: 1
        finishedAt: "2021-10-28T06:52:23Z"
        message: "2021-10-28T06:52:23.245Z\t\e[31mFATAL\e[0m\terror in image scan:
          failed analysis: unable to get missing layers: unable to fetch missing layers:
          twirp error internal: failed to do request: Post \"http://trivy.trivy:8928/twirp/trivy.cache.v1.Cache/MissingBlobs\":
          dial tcp: lookup trivy.trivy on 10.96.0.10:53: no such host\n"
        reason: Error
        startedAt: "2021-10-28T06:52:20Z"
  hostIP: 172.18.0.2
  phase: Failed
  podIP: 10.244.0.6
  podIPs:
  - ip: 10.244.0.6
  qosClass: Burstable
  startTime: "2021-10-28T06:52:20Z"

Daniel Schunack · Answer 11 · Thu Oct 28 2021 15:44:43 GMT+0800 (China Standard Time)

Hi,

I dropped the starboard-operator container again and now the NodeAffinity is set on the Job.
Very Strange :-/
The good things is, it worked now :-D