Kubernetes 1.28.6 E2E tests are not running due to unschedulable master node

Question

Kubernetes 1.28.6 E2E tests are not running due to unschedulable master node

mehulgogri opened this issue 6 months ago · comments

What steps did you take and what happened:
./sonobuoy delete --all --wait
./sonobuoy run --mode=certified-conformance --wait

time="2024-01-24T01:49:28Z" level=info msg="delete request issued" dry-run=false kind=clusterrolebindings names="[]"
time="2024-01-24T01:49:28Z" level=info msg="delete request issued" dry-run=false kind=clusterroles names="[]"
Namespace "sonobuoy" has been deleted
Deleted all ClusterRoles and ClusterRoleBindings.
All E2E namespaces deleted
time="2024-01-24T01:49:35Z" level=info msg="create request issued" name=sonobuoy namespace= resource=namespaces
time="2024-01-24T01:49:35Z" level=info msg="create request issued" name=sonobuoy-serviceaccount namespace=sonobuoy resource=serviceaccounts
time="2024-01-24T01:49:35Z" level=info msg="create request issued" name=sonobuoy-serviceaccount-sonobuoy namespace= resource=clusterrolebindings
time="2024-01-24T01:49:35Z" level=info msg="create request issued" name=sonobuoy-serviceaccount-sonobuoy namespace= resource=clusterroles
time="2024-01-24T01:49:35Z" level=info msg="create request issued" name=sonobuoy-config-cm namespace=sonobuoy resource=configmaps
time="2024-01-24T01:49:35Z" level=info msg="create request issued" name=sonobuoy-plugins-cm namespace=sonobuoy resource=configmaps
time="2024-01-24T01:49:35Z" level=info msg="create request issued" name=sonobuoy namespace=sonobuoy resource=pods
time="2024-01-24T01:49:35Z" level=info msg="create request issued" name=sonobuoy-aggregator namespace=sonobuoy resource=services
01:49:55 Waiting for the aggregator status to become Running. Currently the status is Status: Pending, Reason: ContainersNotReady, containers with unready status: [kube-sonobuoy]
01:49:55 Details of containers that are not ready:
01:49:55 kube-sonobuoy: waiting: ContainerCreating
01:49:55 
...
01:50:35          PLUGIN                                          NODE    STATUS   RESULT   PROGRESS
01:50:35             e2e                                        global   running                    
01:50:35    systemd-logs   m2-lr1-dev-vm209096.mip.storage.hpecorp.net   running                    
01:50:35    systemd-logs   m2-lr1-dev-vm209097.mip.storage.hpecorp.net   running                    
01:50:35    systemd-logs   m2-lr1-dev-vm209099.mip.storage.hpecorp.net   running                    
01:50:35 Sonobuoy is still running. Runs can take 60 minutes or more depending on cluster and plugin configuration.
...
01:51:15    systemd-logs   m2-lr1-dev-vm209096.mip.storage.hpecorp.net   complete                    
01:51:15    systemd-logs   m2-lr1-dev-vm209097.mip.storage.hpecorp.net   complete                    
01:51:15    systemd-logs   m2-lr1-dev-vm209099.mip.storage.hpecorp.net   complete                    
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
02:21:16             e2e                                        global   complete                    
02:21:16 Sonobuoy plugins have completed. Preparing results for download.
02:21:35             e2e                                        global   complete   failed           
02:21:35    systemd-logs   m2-lr1-dev-vm209096.mip.storage.hpecorp.net   complete   passed           
02:21:35    systemd-logs   m2-lr1-dev-vm209097.mip.storage.hpecorp.net   complete   passed           
02:21:35    systemd-logs   m2-lr1-dev-vm209099.mip.storage.hpecorp.net   complete   passed           
02:21:35 Sonobuoy has completed. Use `sonobuoy retrieve` to get results.
         PLUGIN     STATUS   RESULT   COUNT   PROGRESS
            e2e   complete   failed       1           
   systemd-logs   complete   passed       3           
Sonobuoy has completed. Use `sonobuoy retrieve` to get results.
Plugin: e2e
Status: failed
Total: 4
Passed: 3
Failed: 1
Skipped: 0
Failed tests:
[SynchronizedBeforeSuite]
Plugin: systemd-logs
Status: passed
Total: 3
Passed: 3
Failed: 0
Skipped: 0
Run Details:
API Server version: v1.28.6-hpe1
Node health: 3/3 (100%)
Pods health: 12/13 (92%)
Details for failed pods:
sonobuoy/sonobuoy-e2e-job-8b54e25fd7344f4a Ready:False: PodFailed: 
Errors detected in files:
Errors:
1200 podlogs/kube-system/calico-worker-v8h7b/logs/calico-node.txt
  66 podlogs/kube-system/calico-typha-5769bc9c7-lnk5k/logs/calico-typha.txt
  41 podlogs/kube-system/calico-worker-97pgj/logs/calico-node.txt
  32 podlogs/kube-system/calico-master-stghw/logs/calico-node.txt
  26 podlogs/kube-system/calico-kube-controllers-7bd799d976-x9fw7/logs/calico-kube-controllers.txt
   6 podlogs/sonobuoy/sonobuoy-e2e-job-8b54e25fd7344f4a/logs/e2e.txt
   3 podlogs/kube-system/coredns-6c45976c97-tblqq/logs/coredns.txt
   3 podlogs/kube-system/coredns-6c45976c97-vxpqf/logs/coredns.txt
Warnings:
401 podlogs/kube-system/calico-worker-v8h7b/logs/calico-node.txt
 10 podlogs/kube-system/calico-master-stghw/logs/calico-node.txt
 10 podlogs/kube-system/calico-worker-97pgj/logs/calico-node.txt
  1 podlogs/kube-system/calico-kube-controllers-7bd799d976-x9fw7/logs/calico-kube-controllers.txt
  1 podlogs/sonobuoy/sonobuoy-e2e-job-8b54e25fd7344f4a/logs/e2e.txt
  1 podlogs/sonobuoy/sonobuoy/logs/kube-sonobuoy.txt
time="2024-01-24T02:21:41Z" level=info msg="delete request issued" dry-run=false kind=namespace namespace=sonobuoy
time="2024-01-24T02:21:41Z" level=info msg="delete request issued" dry-run=false kind=clusterrolebindings names="[sonobuoy-serviceaccount-sonobuoy]"
time="2024-01-24T02:21:42Z" level=info msg="delete request issued" dry-run=false kind=clusterroles names="[sonobuoy-serviceaccount-sonobuoy]"
Namespace "sonobuoy" has status {Phase:Terminating Conditions:[{Type:NamespaceDeletionDiscoveryFailure Status:False LastTransitionTime:2024-01-24 02:21:25 +0000 UTC Reason:ResourcesDiscovered Message:All resources successfully discovered} {Type:NamespaceDeletionGroupVersionParsingFailure Status:False LastTransitionTime:2024-01-24 02:21:25 +0000 UTC Reason:ParsedGroupVersions Message:All legacy kube types successfully parsed} {Type:NamespaceDeletionContentFailure Status:False LastTransitionTime:2024-01-24 02:21:25 +0000 UTC Reason:ContentDeleted Message:All content successfully deleted, may be waiting on finalization} {Type:NamespaceContentRemaining Status:True LastTransitionTime:2024-01-24 02:21:25 +0000 UTC Reason:SomeResourcesRemain Message:Some resources are remaining: pods. has 4 resource instances} {Type:NamespaceFinalizersRemaining Status:False LastTransitionTime:2024-01-24 02:21:25 +0000 UTC Reason:ContentHasNoFinalizers Message:All content-preserving finalizers finished}]}
Namespace "sonobuoy" has been deleted
Deleted all ClusterRoles and ClusterRoleBindings.
All E2E namespaces deleted
Done conformance-test

I see below error in e2e.txt file
Jan 24 02:58:04.888: INFO: >>> kubeConfig: /tmp/kubeconfig-1910277190
Jan 24 02:58:04.890: INFO: Waiting up to 30m0s for all (but 0) nodes to be schedulable
  E0124 02:58:04.897527      23 progress.go:96] Failed to post progress update to http://localhost:8099/progress: Post "http://localhost:8099/progress": dial tcp 127.0.0.1:8099: connect: connection refused
  E0124 02:58:04.897635      23 progress.go:96] Failed to post progress update to http://localhost:8099/progress: Post "http://localhost:8099/progress": dial tcp 127.0.0.1:8099: connect: connection refused
Jan 24 02:58:04.901: INFO: Unschedulable nodes= 1, maximum value for starting tests= 0
Jan 24 02:58:04.901: INFO: 	-> Node m2-lr1-dev-vm209096.mip.storage.hpecorp.net [[[ Ready=true, Network(available)=true, Taints=[{node-role.kubernetes.io/master  NoSchedule <nil>}], NonblockingTaints=node-role.kubernetes.io/control-plane ]]]
Jan 24 02:58:04.901: INFO: ==== node wait: 2 out of 3 nodes are ready, max notReady allowed 0.  Need 1 more before starting.

What did you expect to happen:
I expected the e2e tests to run

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

Sonobuoy version: 0.57.1
Kubernetes version: (use kubectl version): 1.28.6
Kubernetes installer & version: 1.28.6
Cloud provider or hardware configuration: VMware VM
OS (e.g. from /etc/os-release): SLES15SP3
Sonobuoy tarball (which contains * below): sonobuoy_0.57.1_linux_amd64.tar.gz
Cluster - one master and two worker nodes

Attached e2e.txt file: e2e.txt

mehulgogri · Answer 1 · Thu Jan 25 2024 06:38:06 GMT+0800 (China Standard Time)

I ran the sonobuoy command by passing --non-blocking-taints e2e extra args, but it did not work.

./sonobuoy run --mode=certified-conformance --wait --plugin-env e2e.E2E_EXTRA_ARGS="--non-blocking-taints=node-role.kubernetes.io/master:NoSchedule"

Could someone please share an example of how to run the sonobuoy run command with --non-block-taints?

mehulgogri · Answer 2 · Sat Jan 27 2024 04:19:24 GMT+0800 (China Standard Time)

I was able to run the e2e test by running sonobuoy command as below
./sonobuoy run --mode=certified-conformance --wait --plugin-env e2e.E2E_EXTRA_ARGS="--non-blocking-taints=node-role.kubernetes.io/master"