Can not run e2e test on a modified cluster

Question

Can not run e2e test on a modified cluster

STX5 opened this issue a year ago · comments

What steps did you take and what happened:
I try to run sonobuoy in my cluster with this command

sonobuoy run --wait --kubeconfig kubeconfig.txt

and the result is like this

15:21:13             e2e                        global     failed   failed
15:21:13    systemd-logs   cn-zhangjiakou.1.1.1.1  complete   passed
15:21:13    systemd-logs   cn-zhangjiakou.1.1.1.2   complete   passed
15:21:13    systemd-logs    cn-zhangjiakou.1.1.1.3    complete   passed
15:21:13    systemd-logs    cn-zhangjiakou.1.1.1.4   complete   passed

I retrived the result and found that the e2e test failed on pulling image "registry.k8s.io/conformance:v1.18.8".

rpc error: code = unknown desc = error response from daemon: get https://asia-east1-docker.pkg.dev/v2/k8s-artifacts-prod/images/conformance/manifests/v1.18.8: dial tcp 74.125.23.82:443: i/o timeout

I assume this means my cluster has trouble accessing the public registry, so I pushed this image to my private registry

$ docker push alien-registry.alibaba-inc.com/sonobuoy:v0.56.16
$ docker push alien-registry.alibaba-inc.com/conformance:v1.18.8

then I used sonobuoy gen to manually create a yaml file(I have to do so, because my private registry requires ImagePullSecret)

$ kubectl create secret docker-registry regcred --docker-server=alien-registry.alibaba-inc.com  --docker-username={XXX} --docker-password={XXX}--namespace=sonobuoy --kubeconfig kubeconfig.txt
secret/regcred created

$ echo '{"ImagePullSecrets":"regcred"}' > secretconfig.json
$ sonobuoy gen --config secretconfig.json --kubeconfig kubeconfig.txt 
	--kube-conformance-image alien-registry.alibaba-inc.com/conformance:v1.18.8 
  -sonobuoy-image alien-registry.alibaba-inc.com/sonobuoy:v0.56.16 > test.yaml

then I started the test

$ sonobuoy run --wait --kubeconfig kubeconfig.txt -f test.yaml
...
...
         PLUGIN     STATUS   RESULT   COUNT   PROGRESS
            e2e     failed   failed       1
   systemd-logs   complete   passed       4

I tried to inspect the cluster when sonobuoy is running

$ kubectl -n sonobuoy get pods --kubeconfig kubeconfig.txt
NAME                                                      READY   STATUS    RESTARTS   AGE
sonobuoy                                                  1/1     Running   0          6m49s
sonobuoy-e2e-job-0888f06407d84816                         1/2     Error     0          6m48s
sonobuoy-systemd-logs-daemon-set-22c5c7e57f2a40a5-4spsp   2/2     Running   0          6m48s
sonobuoy-systemd-logs-daemon-set-22c5c7e57f2a40a5-6jmcj   2/2     Running   0          6m48s
sonobuoy-systemd-logs-daemon-set-22c5c7e57f2a40a5-n4ptk   2/2     Running   0          6m48s
sonobuoy-systemd-logs-daemon-set-22c5c7e57f2a40a5-nbmvv   2/2     Running   0          6m48s         0          115s

A close look at the error pod

$ kubectl -n sonobuoy describe pod sonobuoy-e2e-job-0888f06407d84816 --kubeconfig kubeconfig.txt
...
Containers:
  e2e:
    Container ID:  docker://244457a29060c4ca0582dfde150c5cc416607003ab9854eb40dc127cdffed11f
    Image:         alien-registry.alibaba-inc.com/conformance:v1.18.8
    Image ID:      docker-pullable://alien-registry.alibaba-inc.com/conformance@sha256:15cd6405e4baaeb3d13b25a296115783bba53dacad2c9e06ef530c24c5860ff4
    Port:          <none>
    Host Port:     <none>
    Command:
      /run_e2e.sh
    State:          Terminated
      Reason:       Error
      Exit Code:    126
      Started:      Fri, 09 Jun 2023 15:13:35 +0800
      Finished:     Fri, 09 Jun 2023 15:13:35 +0800
    Ready:          False
    Restart Count:  0
...
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
...
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  8m1s  default-scheduler  Successfully assigned sonobuoy/sonobuoy-e2e-job-0888f06407d84816 to cn-zhangjiakou.172.16.0.178
  Normal  Pulled     8m    kubelet            Container image "alien-registry.alibaba-inc.com/conformance:v1.18.8" already present on machine
  Normal  Created    8m    kubelet            Created container e2e
  Normal  Started    8m    kubelet            Started container e2e
  Normal  Pulled     8m    kubelet            Container image "alien-registry.alibaba-inc.com/sonobuoy:v0.56.16" already present on machine
  Normal  Created    8m    kubelet            Created container sonobuoy-worker
  Normal  Started    8m    kubelet            Started container sonobuoy-worker
  Normal  Killing    25s   kubelet            Stopping container sonobuoy-worker

The image was successfully pulled, but it seems that the container didn't correctly start.
I retrived the result, but no useful error message was logged.

name: e2e
status: failed
meta:
  type: summary
items:
- name: 'Container e2e is in a terminated state (exit code 126) due to reason: Error: '
  status: failed
  meta:
    file: errors/global/error.json
  details:
    error: 'Container e2e is in a terminated state (exit code 126) due to reason:Error: '

What did you expect to happen:
Maybe more specific error message in the result file
Environment:

Sonobuoy version:

Sonobuoy Version: v0.56.16
MinimumKubeVersion: 1.17.0
MaximumKubeVersion: 1.99.99
GitSHA:
GoVersion: go1.20.1
Platform: darwin/arm64
API Version:  v1.18.8-aliyun.1

Kubernetes version: (use kubectl version):

Server Version: v1.18.8-aliyun.1

Kubernetes installer & version: aliyun
Cloud provider or hardware configuration: aliyun
OS (e.g. from /etc/os-release):

NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

Sonobuoy tarball (which contains * below)

Peter Grant · Answer 1 · Wed Jun 14 2023 18:00:31 GMT+0800 (China Standard Time)

Hi @STX5,

Were you able/are you able to reproduce and capture the logs from the errored container? My guess at this stage is that what you're doing is correct although there is no construct for defining a private registry for the containers used by the e2e test. There is an open pull request #1893 that addresses this.

Divya Rani · Answer 2 · Thu Jun 15 2023 12:58:01 GMT+0800 (China Standard Time)

If the Sonobuoy e2e pod is using a private registry, it should be able to pull the images if the instructions are followed - https://sonobuoy.io/docs/v0.19.0/pullsecrets/ but the secret should be in the same ns as the e2e pod.

PR #1893 addresses the issue when the containers spawned by the e2e process are using a private registry, as they will be created in different namespaces with no access to secrets.

STX5 · Answer 3 · Thu Jun 15 2023 16:49:49 GMT+0800 (China Standard Time)

@franknstyle ，Thank you for your reply. I captured the logs from the errored container:

line 43: /gorunner: cannot execute binary file: Exec format error

The container image was pulled with command docker pull registry.k8s.io/conformance:v1.18.8 , and then pushed to private registry.（I'm using a M1 MacBook, the Cluster is x86 arch）

It was weird, because I had used docker inspect to make sure I pulled the right image "Architecture": "amd64", "Os": "linux". And the other image sonobuoy wokred just fine.

Whatever, I switched back to my old Intel MacBook, and pushed another image to try again.

$ sonobuoy run --wait --kubeconfig kubeconfig.txt -f test.yaml
$ kubectl -n sonobuoy get pods --kubeconfig kubeconfig.txt
NAME                                                      READY   STATUS    RESTARTS   AGE
sonobuoy                                                  1/1     Running   0          24s
sonobuoy-e2e-job-fc17e4758ab743a8                         2/2     Running   0          23s
sonobuoy-systemd-logs-daemon-set-ffb20b1d02124de5-2spvc   2/2     Running   0          23s
sonobuoy-systemd-logs-daemon-set-ffb20b1d02124de5-8j5rx   2/2     Running   0          23s
sonobuoy-systemd-logs-daemon-set-ffb20b1d02124de5-kq2ns   2/2     Running   0          23s
sonobuoy-systemd-logs-daemon-set-ffb20b1d02124de5-vrq79   2/2     Running   0

It looked good, but the test was not progressing. I looked at the log in the e2e-job pod:

...
Jun 12 08:02:03.017: INFO: ================================
Jun 12 08:02:33.016: INFO: Condition Ready of node med-delay-monitor-node is false, but Node is tainted by NodeController with [{node.kubernetes.io/unreachable  NoSchedule 2021-01-28 07:52:32 +0000 UTC} {node.kubernetes.io/unreachable  NoExecute 2023-02-08 03:03:16 +0000 UTC}]. Failure
Jun 12 08:02:33.016: INFO: Unschedulable nodes:
Jun 12 08:02:33.016: INFO: -> ˇ Ready=false Network=false Taints=[{node.kubernetes.io/unreachable  NoSchedule 2021-01-28 07:52:32 +0000 UTC} {node.kubernetes.io/unreachable  NoExecute 2023-02-08 03:03:16 +0000 UTC}] NonblockingTaints:node-role.kubernetes.io/master
Jun 12 08:02:33.016: INFO: ================================
Jun 12 08:03:03.017: INFO: Condition Ready of node med-delay-monitor-node is false, but Node is tainted by NodeController with [{node.kubernetes.io/unreachable  NoSchedule 2021-01-28 07:52:32 +0000 UTC} {node.kubernetes.io/unreachable  NoExecute 2023-02-08 03:03:16 +0000 UTC}]. Failure
Jun 12 08:03:03.017: INFO: Unschedulable nodes:
Jun 12 08:03:03.017: INFO: -> med-delay-monitor-node Ready=false Network=false Taints=[{node.kubernetes.io/unreachable  NoSchedule 2021-01-28 07:52:32 +0000 UTC} {node.kubernetes.io/unreachable  NoExecute 2023-02-08 03:03:16 +0000 UTC}] NonblockingTaints:node-role.kubernetes.io/master
Jun 12 08:03:03.017: INFO: ================================
Jun 12 08:03:33.017: INFO: Condition Ready of node med-delay-monitor-node is false, but Node is tainted by NodeController with [{node.kubernetes.io/unreachable  NoSchedule 2021-01-28 07:52:32 +0000 UTC} {node.kubernetes.io/unreachable  NoExecute 2023-02-08 03:03:16 +0000 UTC}]. Failure
Jun 12 08:03:33.017: INFO: Unschedulable nodes:
Jun 12 08:03:33.017: INFO: -> med-delay-monitor-node Ready=false Network=false Taints=[{node.kubernetes.io/unreachable  NoSchedule 2021-01-28 07:52:32 +0000 UTC} {node.kubernetes.io/unreachable  NoExecute 2023-02-08 03:03:16 +0000 UTC}] NonblockingTaints:node-role.kubernetes.io/master
Jun 12 08:03:33.017: INFO: ================================

I think maybe @Divya063 's explanation make a point. The e2e test creates other namespaces, which causes the ImagePullSecret not working.

BTW, sonobuoy images pull seems not working properly with some k8s versions

$ sonobuoy images pull --kubernetes-version v1.18.8
INFO[0000] e2e image to be used: registry.k8s.io/conformance:v1.18.8
ERRO[0000] failed to gather test images from e2e image: exit status 1
ERRO[0000] unable to collect images of plugins