Add nodeSelector for Scan Jobs
dschunack opened this issue · comments
What steps did you take and what happened:
At the moment is it not possible to add a NodeSelector to run the scan jobs only on Linux Workers.
We have also Windows workers and without a NodeSelector like this:
nodeSelector:
kubernetes.io/os: linux
The PODs are placed on a Windows Worker node instead of a Linux Worker. This will fail and PODs are hang in the Creating State for hours. It's not possible at the moment to run Linux Containers on Windows or conversely.
kubectl -n starboard-operator get pods
NAME READY STATUS RESTARTS AGE
scan-vulnerabilityreport-5d478ff6c5-m9nn9 0/1 Completed 0 24h
scan-vulnerabilityreport-5f655c6dd9-jjjsj 0/1 ContainerCreating 0 2h
scan-vulnerabilityreport-6b8d6989fb-9pc5c 0/3 ContainerCreating 0 2h
scan-vulnerabilityreport-6cfc49d9f5-bsr9w 0/1 ContainerCreating 0 2h
We set the NodeSelector in the helm chart values, but it's only used for the operator itself and not for the scheduled PODs by the Operator. Is it possible to added this in the Operator ?
What did you expect to happen:
PODs placed by the Operator based on a NodeSelector to run it only on Linux.
Environment:
- Starboard version (use
starboard version
): Helm Chart version 0.7.0 - Kubernetes version (use
kubectl version
): EKS 1.20 - OS (macOS 10.15, Windows 10, Ubuntu 19.10 etc):
@krol3 Can you have a look at this one? I remember some time back we added node labels and affinities to handle Windows and Linux nodes. Is there anything missing or is it a regression?
@danielpacak I will take a look, I remembered tested using the starboard client.
👋 @dschunack I think it would be pretty easy to extend Starboard configuration to allow setting custom nodeSelectors for scan pods. However, before we do that I'm wondering why the nodeAffinity allows scheduling pods on Windows workers even if we set it for each vulnerability scan job?
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2021-10-22T08:00:49Z"
generateName: scan-vulnerabilityreport-74ddf5fb6-
labels:
app.kubernetes.io/managed-by: starboard
controller-uid: 872baf5a-f157-4160-b79d-9d1ac76661d1
job-name: scan-vulnerabilityreport-74ddf5fb6
resource-spec-hash: 74777446d5
starboard.resource.kind: Deployment
starboard.resource.name: nginx
starboard.resource.namespace: default
vulnerabilityReport.scanner: "true"
name: scan-vulnerabilityreport-74ddf5fb6-swb9z
namespace: starboard
ownerReferences:
- apiVersion: batch/v1
blockOwnerDeletion: true
controller: true
kind: Job
name: scan-vulnerabilityreport-74ddf5fb6
uid: 872baf5a-f157-4160-b79d-9d1ac76661d1
resourceVersion: "350275"
uid: 119bf51f-b129-4327-b491-90fb2c78573f
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/os
operator: In
values:
- linux
automountServiceAccountToken: false
# [trimmed output...]
Hi,
the "nodeAffinity" is also fine for us, but it's not set at the moment on the Scan Pods.
Is the "nodeAffinity" a extra env variable? Maybe, something is missing in our cm or deployment of the starboard operator.
We use starboard operator 0.12.0
apiVersion: v1
kind: Pod
metadata:
annotations:
kubernetes.io/psp: eks.privileged
creationTimestamp: "2021-10-22T15:03:36Z"
generateName: scan-vulnerabilityreport-5f655c6dd9-
labels:
app.kubernetes.io/managed-by: starboard
controller-uid: be42db70-a7f2-4318-b377-24563fdc4457
job-name: scan-vulnerabilityreport-5f655c6dd9
resource-spec-hash: 968f9cf69
starboard.resource.kind: ReplicaSet
starboard.resource.name: vpc-resource-controller-6677fc869b
starboard.resource.namespace: kube-system
vulnerabilityReport.scanner: "true"
name: scan-vulnerabilityreport-5f655c6dd9-5j5tn
namespace: starboard-operator
ownerReferences:
- apiVersion: batch/v1
blockOwnerDeletion: true
controller: true
kind: Job
name: scan-vulnerabilityreport-5f655c6dd9
uid: be42db70-a7f2-4318-b377-24563fdc4457
resourceVersion: "41225140"
uid: 3fcb2f89-7d8c-4e21-980c-85c80d654545
spec:
automountServiceAccountToken: false
containers:
- args:
- --quiet
- client
Hmm looking at the code of v0.12.0 we set nodeAffinity in Trivy plugin only in Standalone mode. Do you happen to configure Trivy plugin in ClientServer mode? That would explain the mystery...
Yes, we use starboard in the ClientServer mode connected to a central Trivy Server.
Got it. Okey so the fix would be one liner - similar to what we do in Standalone mode https://github.com/aquasecurity/starboard/blob/v0.12.0/pkg/plugin/trivy/plugin.go#L558
Hi @danielpacak ,
I tested the new startboard-operator version 0.13.0-rc1 but the nodeAffinity is still not set.
kubectl -n starboard-operator get cm starboard-trivy-config -o yaml
apiVersion: v1
data:
trivy.ignoreUnfixed: "true"
trivy.imageRef: docker.io/aquasec/trivy:0.19.2
trivy.mode: ClientServer
trivy.resources.limits.cpu: 500m
trivy.resources.limits.memory: 500M
trivy.resources.requests.cpu: 100m
trivy.resources.requests.memory: 100M
trivy.serverURL: http://xxxxxx:8080
trivy.severity: MEDIUM,HIGH,CRITICAL
kind: ConfigMap
kubectl -n starboard-operator get pod scan-vulnerabilityreport-76cc4759bc-7hdqb -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
kubernetes.io/psp: eks.privileged
creationTimestamp: "2021-10-27T15:09:05Z"
generateName: scan-vulnerabilityreport-76cc4759bc-
labels:
app.kubernetes.io/managed-by: starboard
controller-uid: 81241fa9-0fd1-45bc-a263-88582f57b68c
job-name: scan-vulnerabilityreport-76cc4759bc
resource-spec-hash: 76fb94fb5
starboard.resource.kind: ReplicaSet
starboard.resource.name: starboard-operator-647499bfb7
starboard.resource.namespace: starboard-operator
vulnerabilityReport.scanner: "true"
name: scan-vulnerabilityreport-76cc4759bc-7hdqb
namespace: starboard-operator
ownerReferences:
- apiVersion: batch/v1
blockOwnerDeletion: true
controller: true
kind: Job
name: scan-vulnerabilityreport-76cc4759bc
uid: 81241fa9-0fd1-45bc-a263-88582f57b68c
resourceVersion: "43393052"
uid: 31299ac4-b5f9-41ac-a148-65d54dc33a4c
spec:
automountServiceAccountToken: false
containers:
- args:
- --quiet
- client
- --format
- json
- --remote
- http://XXXXX:8080
- public.ecr.aws/aquasecurity/starboard-operator:0.13.0-rc1
command:
Sorry to hear that @dschunack and let me check again ...
@dschunack Can you double check and share the whole scan pod descriptor? In my env the node affinity is there:
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2021-10-28T06:52:20Z"
generateName: scan-vulnerabilityreport-866469b84d-
labels:
app.kubernetes.io/managed-by: starboard
controller-uid: 658e40f6-e0f9-4d4b-b493-1abf0f516ef6
job-name: scan-vulnerabilityreport-866469b84d
resource-spec-hash: 7cb64cb677
starboard.resource.kind: ReplicaSet
starboard.resource.name: nginx-6d4cf56db6
starboard.resource.namespace: default
vulnerabilityReport.scanner: Trivy
name: scan-vulnerabilityreport-866469b84d-mwhpr
namespace: starboard
ownerReferences:
- apiVersion: batch/v1
blockOwnerDeletion: true
controller: true
kind: Job
name: scan-vulnerabilityreport-866469b84d
uid: 658e40f6-e0f9-4d4b-b493-1abf0f516ef6
resourceVersion: "13628"
uid: 95d13eef-9b28-4903-917c-1ea6f77ba5c8
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/os
operator: In
values:
- linux
automountServiceAccountToken: false
containers:
- args:
- --quiet
- client
- --format
- json
- --remote
- http://trivy.trivy:8928
- nginx:1.16
command:
- trivy
env:
- name: HTTP_PROXY
valueFrom:
configMapKeyRef:
key: trivy.httpProxy
name: starboard-trivy-config
optional: true
- name: HTTPS_PROXY
valueFrom:
configMapKeyRef:
key: trivy.httpsProxy
name: starboard-trivy-config
optional: true
- name: NO_PROXY
valueFrom:
configMapKeyRef:
key: trivy.noProxy
name: starboard-trivy-config
optional: true
- name: TRIVY_SEVERITY
valueFrom:
configMapKeyRef:
key: trivy.severity
name: starboard-trivy-config
optional: true
- name: TRIVY_IGNORE_UNFIXED
valueFrom:
configMapKeyRef:
key: trivy.ignoreUnfixed
name: starboard-trivy-config
optional: true
- name: TRIVY_SKIP_FILES
valueFrom:
configMapKeyRef:
key: trivy.skipFiles
name: starboard-trivy-config
optional: true
- name: TRIVY_SKIP_DIRS
valueFrom:
configMapKeyRef:
key: trivy.skipDirs
name: starboard-trivy-config
optional: true
- name: TRIVY_TOKEN_HEADER
valueFrom:
configMapKeyRef:
key: trivy.serverTokenHeader
name: starboard-trivy-config
optional: true
- name: TRIVY_TOKEN
valueFrom:
secretKeyRef:
key: trivy.serverToken
name: starboard-trivy-config
optional: true
- name: TRIVY_CUSTOM_HEADERS
valueFrom:
secretKeyRef:
key: trivy.serverCustomHeaders
name: starboard-trivy-config
optional: true
image: docker.io/aquasec/trivy:0.20.0
imagePullPolicy: IfNotPresent
name: nginx
resources:
limits:
cpu: 500m
memory: 500M
requests:
cpu: 100m
memory: 100M
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: FallbackToLogsOnError
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: kind-control-plane
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Never
schedulerName: default-scheduler
securityContext: {}
serviceAccount: starboard
serviceAccountName: starboard
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2021-10-28T06:52:20Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2021-10-28T06:52:23Z"
message: 'containers with unready status: [nginx]'
reason: ContainersNotReady
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2021-10-28T06:52:23Z"
message: 'containers with unready status: [nginx]'
reason: ContainersNotReady
status: "False"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2021-10-28T06:52:20Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: containerd://b48b478196ad1c00cdafc31c3b57b2d05ab7014be2ab621ca909a7195b529542
image: docker.io/aquasec/trivy:0.20.0
imageID: docker.io/aquasec/trivy@sha256:76d47e5917c583fcad5ab4f83a23cb5e534c34649a994c73722fe6dfd86f2855
lastState: {}
name: nginx
ready: false
restartCount: 0
started: false
state:
terminated:
containerID: containerd://b48b478196ad1c00cdafc31c3b57b2d05ab7014be2ab621ca909a7195b529542
exitCode: 1
finishedAt: "2021-10-28T06:52:23Z"
message: "2021-10-28T06:52:23.245Z\t\e[31mFATAL\e[0m\terror in image scan:
failed analysis: unable to get missing layers: unable to fetch missing layers:
twirp error internal: failed to do request: Post \"http://trivy.trivy:8928/twirp/trivy.cache.v1.Cache/MissingBlobs\":
dial tcp: lookup trivy.trivy on 10.96.0.10:53: no such host\n"
reason: Error
startedAt: "2021-10-28T06:52:20Z"
hostIP: 172.18.0.2
phase: Failed
podIP: 10.244.0.6
podIPs:
- ip: 10.244.0.6
qosClass: Burstable
startTime: "2021-10-28T06:52:20Z"
Hi,
I dropped the starboard-operator container again and now the NodeAffinity is set on the Job.
Very Strange :-/
The good things is, it worked now :-D