Not able to complete a deployment - failed during the sampleapp validation

Question

Not able to complete a deployment - failed during the sampleapp validation

Gl1TcH-1n-Th3-M4tR1x opened this issue 2 years ago · comments

Gl1TcH_1n_Th3_M4tR1x commented 2 years ago

Describe the bug
After running a deployment by using the container image, it fails during the sampleapp validation

To Reproduce
Steps to reproduce the behavior:

Clone kubeinit
Run from the container with:

podman run --rm -it -v ~/.ssh/id_rsa:/root/.ssh/id_rsa:z -v ~/.ssh/id_rsa.pub:/root/.ssh/id_rsa.pub:z -v ~/.ssh/config:/root/.ssh/cong:z -v ./kubeinit/inventory:/kubeinit/kubeinit/inventory quay.io/kubeinit/kubeinit:2.0.1 -vvv --user root -e kubeinit_spec=okd-libvirt-1-3-1 -i ./kubeinit/inventory ./kubeinit/playbook.ym

Error:

TASK [kubeinit.kubeinit.kubeinit_apps : Wait until pods are created] ******************************************************************************
task path: /root/.ansible/collections/ansible_collections/kubeinit/kubeinit/roles/kubeinit_apps/tasks/sampleapp.yml:35
Using module file /usr/lib/python3.9/site-packages/ansible/modules/command.py
Pipelining is enabled.
<10.0.0.253> ESTABLISH SSH CONNECTION FOR USER: root
<10.0.0.253> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="root"' -o ConnectTimeout=10 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=accept-new -i '~/.ssh/okdcluster_id_rsa' -o 'ProxyCommand=ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=accept-new -i ~/.ssh/okdcluster_id_rsa -W %h:%p -q root@nyctea' -o ControlPath=/root/.ansible/cp/2a631e4199 10.0.0.253 '/bin/sh -c '"'"'/usr/bin/python3 && sleep 0'"'"''
<10.0.0.253> (0, b'\n{"changed": true, "stdout": "sampleapp-6684887657-dspc5   0/1     Pending   0          0s\\nsampleapp-6684887657-fr8r5   0/1     Pending   0          0s\\nsampleapp-6684887657-t69wp   0/1     Pending   0          0s\\nsampleapp-6684887657-zl59s   0/1     Pending   0          0s", "stderr": "", "rc": 0, "cmd": "set -o pipefail\\nkubectl get pods --namespace=sampleapp | grep sampleapp\\n", "start": "2022-04-01 14:55:34.298003", "end": "2022-04-01 14:55:34.371286", "delta": "0:00:00.073283", "msg": "", "invocation": {"module_args": {"executable": "/bin/bash", "_raw_params": "set -o pipefail\\nkubectl get pods --namespace=sampleapp | grep sampleapp\\n", "_uses_shell": true, "warn": false, "stdin_add_newline": true, "strip_empty_ends": true, "argv": null, "chdir": null, "creates": null, "removes": null, "stdin": null}}}\n', b'')
changed: [localhost -> service(10.0.0.253)] => {
    "attempts": 1,
    "changed": true,
    "cmd": "set -o pipefail\nkubectl get pods --namespace=sampleapp | grep sampleapp\n",
    "delta": "0:00:00.073283",
    "end": "2022-04-01 14:55:34.371286",
    "invocation": {
        "module_args": {
            "_raw_params": "set -o pipefail\nkubectl get pods --namespace=sampleapp | grep sampleapp\n",
            "_uses_shell": true,
            "argv": null,
            "chdir": null,
            "creates": null,
            "executable": "/bin/bash",
            "removes": null,
            "stdin": null,
            "stdin_add_newline": true,
            "strip_empty_ends": true,
            "warn": false
        }
    },
    "msg": "",
    "rc": 0,
    "start": "2022-04-01 14:55:34.298003",
    "stderr": "",
    "stderr_lines": [],
    "stdout": "sampleapp-6684887657-dspc5   0/1     Pending   0          0s\nsampleapp-6684887657-fr8r5   0/1     Pending   0          0s\nsampleapp-6684887657-t69wp   0/1     Pending   0          0s\nsampleapp-6684887657-zl59s   0/1     Pending   0          0s",
    "stdout_lines": [
        "sampleapp-6684887657-dspc5   0/1     Pending   0          0s",
        "sampleapp-6684887657-fr8r5   0/1     Pending   0          0s",
        "sampleapp-6684887657-t69wp   0/1     Pending   0          0s",
        "sampleapp-6684887657-zl59s   0/1     Pending   0          0s"
    ]
}

TASK [kubeinit.kubeinit.kubeinit_apps : Wait until pods are running] ******************************************************************************
task path: /root/.ansible/collections/ansible_collections/kubeinit/kubeinit/roles/kubeinit_apps/tasks/sampleapp.yml:48
Using module file /usr/lib/python3.9/site-packages/ansible/modules/command.py
Pipelining is enabled.
<10.0.0.253> ESTABLISH SSH CONNECTION FOR USER: root
<10.0.0.253> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="root"' -o ConnectTimeout=10 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=accept-new -i '~/.ssh/okdcluster_id_rsa' -o 'ProxyCommand=ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=accept-new -i ~/.ssh/okdcluster_id_rsa -W %h:%p -q root@nyctea' -o ControlPath=/root/.ansible/cp/2a631e4199 10.0.0.253 '/bin/sh -c '"'"'/usr/bin/python3 && sleep 0'"'"''
<10.0.0.253> (1, b'\n{"changed": true, "stdout": "", "stderr": "", "rc": 1, "cmd": "set -o pipefail\\nkubectl get pods --namespace=sampleapp | grep Running\\n", "start": "2022-04-01 14:55:34.527107", "end": "2022-04-01 14:55:34.597990", "delta": "0:00:00.070883", "failed": true, "msg": "non-zero return code", "invocation": {"module_args": {"executable": "/bin/bash", "_raw_params": "set -o pipefail\\nkubectl get pods --namespace=sampleapp | grep Running\\n", "_uses_shell": true, "warn": false, "stdin_add_newline": true, "strip_empty_ends": true, "argv": null, "chdir": null, "creates": null, "removes": null, "stdin": null}}}\n', b'')
<10.0.0.253> Failed to connect to the host via ssh: 
FAILED - RETRYING: [localhost -> service]: Wait until pods are running (60 retries left).Result was: {
    "attempts": 1,
    "changed": false,
    "cmd": "set -o pipefail\nkubectl get pods --namespace=sampleapp | grep Running\n",
    "delta": "0:00:00.070883",
    "end": "2022-04-01 14:55:34.597990",
    "invocation": {
        "module_args": {
            "_raw_params": "set -o pipefail\nkubectl get pods --namespace=sampleapp | grep Running\n",
            "_uses_shell": true,
            "argv": null,
            "chdir": null,
            "creates": null,
            "executable": "/bin/bash",
            "removes": null,
            "stdin": null,
            "stdin_add_newline": true,
            "strip_empty_ends": true,
            "warn": false
        }
    },
    "msg": "non-zero return code",
    "rc": 1,
    "retries": 61,
    "start": "2022-04-01 14:55:34.527107",
    "stderr": "",
    "stderr_lines": [],
    "stdout": "",
    "stdout_lines": []

< 59 attempts later>
fatal: [localhost -> service(10.0.0.253)]: FAILED! => {
    "attempts": 60,
    "changed": false,
    "cmd": "set -o pipefail\nkubectl get pods --namespace=sampleapp | grep Running\n",
    "delta": "0:00:00.071734",
    "end": "2022-04-01 15:00:52.670565",
    "invocation": {
        "module_args": {
            "_raw_params": "set -o pipefail\nkubectl get pods --namespace=sampleapp | grep Running\n",
            "_uses_shell": true,
            "argv": null,
            "chdir": null,
            "creates": null,
            "executable": "/bin/bash",
            "removes": null,
            "stdin": null,
            "stdin_add_newline": true,
            "strip_empty_ends": true,
            "warn": false
        }
    },
    "msg": "non-zero return code",
    "rc": 1,
    "start": "2022-04-01 15:00:52.598831",
    "stderr": "",
    "stderr_lines": [],
    "stdout": "",
    "stdout_lines": []
}

TroubleShooting

oc get deployments -n sampleapp
NAME        READY   UP-TO-DATE   AVAILABLE   AGE
sampleapp   0/4     4            0           3h32m

sh-4.4# oc describe pod sampleapp-6684887657-vqnfw -n sampleapp
Name:         sampleapp-6684887657-vqnfw
Namespace:    sampleapp
Priority:     0
Node:         worker1/10.0.0.3
Start Time:   Fri, 01 Apr 2022 18:30:23 +0000
Labels:       app=sampleapp
              pod-template-hash=6684887657
Annotations:  k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "openshift-sdn",
                    "interface": "eth0",
                    "ips": [
                        "10.102.0.9"
                    ],
                    "default": true,
                    "dns": {}
                }]
              k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "openshift-sdn",
                    "interface": "eth0",
                    "ips": [
                        "10.102.0.9"
                    ],
                    "default": true,
                    "dns": {}
                }]
              openshift.io/scc: restricted
Status:       Pending
IP:           10.102.0.9
IPs:
  IP:           10.102.0.9
Controlled By:  ReplicaSet/sampleapp-6684887657
Containers:
  nginx:
    Container ID:   
    Image:          quay.io/bitnami/nginx:latest
    Image ID:       
    Port:           80/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       ImagePullBackOff
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nll5f (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-nll5f:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason          Age                     From               Message
  ----     ------          ----                    ----               -------
  Normal   Scheduled       8m27s                   default-scheduler  Successfully assigned sampleapp/sampleapp-6684887657-vqnfw to worker1
  Normal   AddedInterface  8m25s                   multus             Add eth0 [10.102.0.9/23] from openshift-sdn
  Warning  Failed          7m24s                   kubelet            Failed to pull image "quay.io/bitnami/nginx:latest": rpc error: code = Unknown desc = pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp 34.225.41.113:443: connect: no route to host
  Normal   Pulling         6m34s (x4 over 8m25s)   kubelet            Pulling image "quay.io/bitnami/nginx:latest"
  Warning  Failed          6m28s (x3 over 8m19s)   kubelet            Failed to pull image "quay.io/bitnami/nginx:latest": rpc error: code = Unknown desc = pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp 54.144.203.57:443: connect: no route to host
  Warning  Failed          6m28s (x4 over 8m19s)   kubelet            Error: ErrImagePull
  Warning  Failed          6m15s (x6 over 8m18s)   kubelet            Error: ImagePullBackOff
  Normal   BackOff         3m20s (x18 over 8m18s)  kubelet            Back-off pulling image "quay.io/bitnami/nginx:latest"

Expected behavior
A running OKD cluster with 1 Master and 3 Workers

Infrastructure

Hypervisors OS: CentOS-Stream 8
CPUs : 32 Cores
Memory: 128 GB
HDD: 1TB

Deployment command

odman run --rm -it -v ~/.ssh/id_rsa:/root/.ssh/id_rsa:z -v ~/.ssh/id_rsa.pub:/root/.ssh/id_rsa.pub:z -v ~/.ssh/config:/root/.ssh/cong:z -v ./kubeinit/inventory:/kubeinit/kubeinit/inventory quay.io/kubeinit/kubeinit:2.0.1 -vvv --user root -e kubeinit_spec=okd-libvirt-1-3-1 -i ./kubeinit/inventory ./kubeinit/playbook.yml

Inventory file

# Common variables for the inventory
#

[all:vars]

#
# Internal variables
#

ansible_python_interpreter=/usr/bin/python3
ansible_ssh_pipelining=True
ansible_ssh_common_args='-o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=accept-new'

#
# Inventory variables
#

#
# The default for the cluster name is {{ kubeinit_cluster_distro + 'cluster' }}
# You can override this by setting a specific value in kubeinit_inventory_cluster_name

# kubeinit_inventory_cluster_name=mycluster
kubeinit_inventory_cluster_domain=kubeinit.local

kubeinit_inventory_network_name=kimgtnet0

kubeinit_inventory_network=10.0.0.0/24
kubeinit_inventory_gateway_offset=-2
kubeinit_inventory_nameserver_offset=-3
kubeinit_inventory_dhcp_start_offset=1
kubeinit_inventory_dhcp_end_offset=-4

kubeinit_inventory_controller_name_pattern=controller-%02d
kubeinit_inventory_compute_name_pattern=compute-%02d

kubeinit_inventory_post_deployment_services="none"

#
# Cluster definitions
#

# The networks you will use for your kubeinit clusters.  The network name will be used
# to create a libvirt network for the cluster guest vms.  The network cidr will set
# the range of addresses reserved for the cluster nodes.  The gateway offset will be
# used to select the gateway address within the range, a negative offset starts at the
# end of the range, so for network=10.0.0.0/24, gateway_offset=-2 will select 10.0.0.254
# and gateway_offset=1 will select 10.0.0.1 as the address.  Other offset attributes
# follow the same convention.

[kubeinit_networks]
# kimgtnet0 network=10.0.0.0/24 gateway_offset=-2 nameserver_offset=-3 dhcp_start_offset=1 dhcp_end_offset=-4
# kimgtnet1 network=10.0.1.0/24 gateway_offset=-2 nameserver_offset=-3 dhcp_start_offset=1 dhcp_end_offset=-4

# The clusters you are deploying using kubeinit.  If there are no clusters defined here
# then kubeinit will assume you are only using one cluster at a time and will use the
# network defined by kubeinit_inventory_network.

[kubeinit_clusters]
# cluster0 network_name=kimgtnet0
# cluster1 network_name=kimgtnet1
#
# If variables  are defined in this section, they will have precedence when setting
# kubeinit_inventory_post_deployment_services and kubeinit_inventory_network_name
#
# clusterXXX network_name=kimgtnetXXX post_deployment_services="none"
# clusterYYY network_name=kimgtnetYYY post_deployment_services="none"

#
# Hosts definitions
#

# The cluster's guest machines can be distributed across mutiple hosts. By default they
# will be deployed in the first Hypervisor. These hypervisors are activated and used
# depending on how they are referenced in the kubeinit spec string.

[hypervisor_hosts]
hypervisor-01 ansible_host=nyctea
hypervisor-02 ansible_host=tyto

# The inventory will have one host identified as the bastion host. By default, this role will
# be assumed by the first hypervisor, which is the same behavior as the first commented out
# line. The second commented out line would set the second hypervisor to be the bastion host.
# The final commented out line would set the bastion host to be a different host that is not
# being used as a hypervisor for the guests VMs for the clusters using this inventory.

[bastion_host]
# bastion target=hypervisor-01
# bastion target=hypervisor-02
# bastion ansible_host=bastion

# The inventory will have one host identified as the ovn-central host.  By default, this role
# will be assumed by the first hypervisor, which is the same behavior as the first commented
# out line.  The second commented out line would set the second hypervisor to be the ovn-central
# host.

[ovn_central_host]
# ovn-central target=hypervisor-01
# ovn-central target=hypervisor-02

#
# Cluster node definitions
#

# Controller, compute, and extra nodes can be configured as virtual machines or using the
# manually provisioned baremetal machines for the deployment.

# Only use an odd number configuration, this means enabling only 1, 3, or 5 controller nodes
# at a time.

[controller_nodes:vars]
os={'cdk': 'ubuntu', 'eks': 'centos', 'k8s': 'centos', 'kid': 'debian', 'okd': 'coreos', 'rke': 'ubuntu'}
disk=120G
ram=25165824
vcpus=8
maxvcpus=16
type=virtual
target_order=hypervisor-01

[controller_nodes]

[compute_nodes:vars]
os={'cdk': 'ubuntu', 'eks': 'centos', 'k8s': 'centos', 'kid': 'debian', 'okd': 'coreos', 'rke': 'ubuntu'}
disk=120G
ram=25165824
vcpus=8
maxvcpus=16
type=virtual
target_order="hypervisor-02,hypervisor-01"

[compute_nodes]

[extra_nodes:vars]
os={'cdk': 'ubuntu', 'okd': 'coreos'}
disk=20G
ram={'cdk': '8388608', 'okd': '16777216'}
vcpus=8
maxvcpus=16
type=virtual
target_order="hypervisor-02,hypervisor-01"

[extra_nodes]
juju-controller distro=cdk
bootstrap distro=okd

# Service nodes are a set of service containers sharing the same pod network.
# There is an implicit 'provision' service container which will use a base os
# container image based upon the service_nodes:vars os attribute.

[service_nodes:vars]
os={'cdk': 'ubuntu', 'eks': 'centos', 'k8s': 'centos', 'kid': 'debian', 'okd': 'centos', 'rke': 'ubuntu'}
target_order=hypervisor-01

[service_nodes]
service services="bind,dnsmasq,haproxy,apache,registry"

Additional context
Add any other context about the problem here.

Carlos Camacho Gonzalez · Answer 1 · Sat Apr 02 2022 03:43:48 GMT+0800 (China Standard Time)

Looks like a quay issue??

kubelet            Failed to pull image "[quay.io/bitnami/nginx:latest](http://quay.io/bitnami/nginx:latest)": rpc error: code = Unknown desc = pinging container registry [quay.io](http://quay.io/): Get "https://quay.io/v2/": dial tcp [54.144.203.57:443](http://54.144.203.57:443/): connect: no route to host

Gl1TcH_1n_Th3_M4tR1x · Answer 2 · Sat Apr 02 2022 03:57:14 GMT+0800 (China Standard Time)

Yeah, it looks like the sampleapp Pod can't reach Quay .... but strange enough okd-cluster-provision can... any tips for troubleshooting this scenario?

Carlos Camacho Gonzalez · Answer 3 · Sat Apr 02 2022 13:13:07 GMT+0800 (China Standard Time)

Yeah, it looks like the sampleapp Pod can't reach Quay .... but strange enough okd-cluster-provision can... any tips for troubleshooting this scenario?

yeah, that is mostly because for some things we pull from docker (provisioning) and for the infra steps we pull from quay (and the sample app), maybe we could converge to pull everything from the same source.

Gl1TcH_1n_Th3_M4tR1x · Answer 4 · Mon Apr 04 2022 18:43:09 GMT+0800 (China Standard Time)

hummm, it looks more like a routing issue, the worker can't reach quay to pull the image from.

kubelet            Failed to pull image "quay.io/bitnami/nginx:latest": rpc error: code = Unknown desc = pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp 34.225.41.113:443: connect: no route to host

After login into worker1:

[core@worker1 ~]$ ssh test@8.8.8.8
ssh: connect to host 8.8.8.8 port 22: No route to host
[core@worker1 ~]$ ssh -p 443 quay.io
ssh: connect to host quay.io port 443: No route to host

Carlos Camacho Gonzalez · Answer 5 · Wed May 18 2022 02:30:56 GMT+0800 (China Standard Time)

@Gl1TcH-1n-Th3-M4tR1x I have been experiencing similar issues across all the distros... This PR fixed the things for the CI #655 Did you manage to pass that problem?

Carlos Camacho Gonzalez · Answer 6 · Mon Jun 13 2022 16:31:46 GMT+0800 (China Standard Time)

@Gl1TcH-1n-Th3-M4tR1x there was a major breakage because of podman versions not consistent across the different components that are deployed, after #666 I didn't reproduce this anymore.

Gl1TcH_1n_Th3_M4tR1x · Answer 7 · Thu Jun 30 2022 23:56:33 GMT+0800 (China Standard Time)

I can successfully deploy the cluster now.