Not able to complete a deployment - failed during the sampleapp validation
Gl1TcH-1n-Th3-M4tR1x opened this issue · comments
Describe the bug
After running a deployment by using the container image, it fails during the sampleapp validation
To Reproduce
Steps to reproduce the behavior:
- Clone kubeinit
- Run from the container with:
podman run --rm -it -v ~/.ssh/id_rsa:/root/.ssh/id_rsa:z -v ~/.ssh/id_rsa.pub:/root/.ssh/id_rsa.pub:z -v ~/.ssh/config:/root/.ssh/cong:z -v ./kubeinit/inventory:/kubeinit/kubeinit/inventory quay.io/kubeinit/kubeinit:2.0.1 -vvv --user root -e kubeinit_spec=okd-libvirt-1-3-1 -i ./kubeinit/inventory ./kubeinit/playbook.ym
- Error:
TASK [kubeinit.kubeinit.kubeinit_apps : Wait until pods are created] ******************************************************************************
task path: /root/.ansible/collections/ansible_collections/kubeinit/kubeinit/roles/kubeinit_apps/tasks/sampleapp.yml:35
Using module file /usr/lib/python3.9/site-packages/ansible/modules/command.py
Pipelining is enabled.
<10.0.0.253> ESTABLISH SSH CONNECTION FOR USER: root
<10.0.0.253> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="root"' -o ConnectTimeout=10 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=accept-new -i '~/.ssh/okdcluster_id_rsa' -o 'ProxyCommand=ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=accept-new -i ~/.ssh/okdcluster_id_rsa -W %h:%p -q root@nyctea' -o ControlPath=/root/.ansible/cp/2a631e4199 10.0.0.253 '/bin/sh -c '"'"'/usr/bin/python3 && sleep 0'"'"''
<10.0.0.253> (0, b'\n{"changed": true, "stdout": "sampleapp-6684887657-dspc5 0/1 Pending 0 0s\\nsampleapp-6684887657-fr8r5 0/1 Pending 0 0s\\nsampleapp-6684887657-t69wp 0/1 Pending 0 0s\\nsampleapp-6684887657-zl59s 0/1 Pending 0 0s", "stderr": "", "rc": 0, "cmd": "set -o pipefail\\nkubectl get pods --namespace=sampleapp | grep sampleapp\\n", "start": "2022-04-01 14:55:34.298003", "end": "2022-04-01 14:55:34.371286", "delta": "0:00:00.073283", "msg": "", "invocation": {"module_args": {"executable": "/bin/bash", "_raw_params": "set -o pipefail\\nkubectl get pods --namespace=sampleapp | grep sampleapp\\n", "_uses_shell": true, "warn": false, "stdin_add_newline": true, "strip_empty_ends": true, "argv": null, "chdir": null, "creates": null, "removes": null, "stdin": null}}}\n', b'')
changed: [localhost -> service(10.0.0.253)] => {
"attempts": 1,
"changed": true,
"cmd": "set -o pipefail\nkubectl get pods --namespace=sampleapp | grep sampleapp\n",
"delta": "0:00:00.073283",
"end": "2022-04-01 14:55:34.371286",
"invocation": {
"module_args": {
"_raw_params": "set -o pipefail\nkubectl get pods --namespace=sampleapp | grep sampleapp\n",
"_uses_shell": true,
"argv": null,
"chdir": null,
"creates": null,
"executable": "/bin/bash",
"removes": null,
"stdin": null,
"stdin_add_newline": true,
"strip_empty_ends": true,
"warn": false
}
},
"msg": "",
"rc": 0,
"start": "2022-04-01 14:55:34.298003",
"stderr": "",
"stderr_lines": [],
"stdout": "sampleapp-6684887657-dspc5 0/1 Pending 0 0s\nsampleapp-6684887657-fr8r5 0/1 Pending 0 0s\nsampleapp-6684887657-t69wp 0/1 Pending 0 0s\nsampleapp-6684887657-zl59s 0/1 Pending 0 0s",
"stdout_lines": [
"sampleapp-6684887657-dspc5 0/1 Pending 0 0s",
"sampleapp-6684887657-fr8r5 0/1 Pending 0 0s",
"sampleapp-6684887657-t69wp 0/1 Pending 0 0s",
"sampleapp-6684887657-zl59s 0/1 Pending 0 0s"
]
}
TASK [kubeinit.kubeinit.kubeinit_apps : Wait until pods are running] ******************************************************************************
task path: /root/.ansible/collections/ansible_collections/kubeinit/kubeinit/roles/kubeinit_apps/tasks/sampleapp.yml:48
Using module file /usr/lib/python3.9/site-packages/ansible/modules/command.py
Pipelining is enabled.
<10.0.0.253> ESTABLISH SSH CONNECTION FOR USER: root
<10.0.0.253> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="root"' -o ConnectTimeout=10 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=accept-new -i '~/.ssh/okdcluster_id_rsa' -o 'ProxyCommand=ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=accept-new -i ~/.ssh/okdcluster_id_rsa -W %h:%p -q root@nyctea' -o ControlPath=/root/.ansible/cp/2a631e4199 10.0.0.253 '/bin/sh -c '"'"'/usr/bin/python3 && sleep 0'"'"''
<10.0.0.253> (1, b'\n{"changed": true, "stdout": "", "stderr": "", "rc": 1, "cmd": "set -o pipefail\\nkubectl get pods --namespace=sampleapp | grep Running\\n", "start": "2022-04-01 14:55:34.527107", "end": "2022-04-01 14:55:34.597990", "delta": "0:00:00.070883", "failed": true, "msg": "non-zero return code", "invocation": {"module_args": {"executable": "/bin/bash", "_raw_params": "set -o pipefail\\nkubectl get pods --namespace=sampleapp | grep Running\\n", "_uses_shell": true, "warn": false, "stdin_add_newline": true, "strip_empty_ends": true, "argv": null, "chdir": null, "creates": null, "removes": null, "stdin": null}}}\n', b'')
<10.0.0.253> Failed to connect to the host via ssh:
FAILED - RETRYING: [localhost -> service]: Wait until pods are running (60 retries left).Result was: {
"attempts": 1,
"changed": false,
"cmd": "set -o pipefail\nkubectl get pods --namespace=sampleapp | grep Running\n",
"delta": "0:00:00.070883",
"end": "2022-04-01 14:55:34.597990",
"invocation": {
"module_args": {
"_raw_params": "set -o pipefail\nkubectl get pods --namespace=sampleapp | grep Running\n",
"_uses_shell": true,
"argv": null,
"chdir": null,
"creates": null,
"executable": "/bin/bash",
"removes": null,
"stdin": null,
"stdin_add_newline": true,
"strip_empty_ends": true,
"warn": false
}
},
"msg": "non-zero return code",
"rc": 1,
"retries": 61,
"start": "2022-04-01 14:55:34.527107",
"stderr": "",
"stderr_lines": [],
"stdout": "",
"stdout_lines": []
< 59 attempts later>
fatal: [localhost -> service(10.0.0.253)]: FAILED! => {
"attempts": 60,
"changed": false,
"cmd": "set -o pipefail\nkubectl get pods --namespace=sampleapp | grep Running\n",
"delta": "0:00:00.071734",
"end": "2022-04-01 15:00:52.670565",
"invocation": {
"module_args": {
"_raw_params": "set -o pipefail\nkubectl get pods --namespace=sampleapp | grep Running\n",
"_uses_shell": true,
"argv": null,
"chdir": null,
"creates": null,
"executable": "/bin/bash",
"removes": null,
"stdin": null,
"stdin_add_newline": true,
"strip_empty_ends": true,
"warn": false
}
},
"msg": "non-zero return code",
"rc": 1,
"start": "2022-04-01 15:00:52.598831",
"stderr": "",
"stderr_lines": [],
"stdout": "",
"stdout_lines": []
}
- TroubleShooting
oc get deployments -n sampleapp
NAME READY UP-TO-DATE AVAILABLE AGE
sampleapp 0/4 4 0 3h32m
sh-4.4# oc describe pod sampleapp-6684887657-vqnfw -n sampleapp
Name: sampleapp-6684887657-vqnfw
Namespace: sampleapp
Priority: 0
Node: worker1/10.0.0.3
Start Time: Fri, 01 Apr 2022 18:30:23 +0000
Labels: app=sampleapp
pod-template-hash=6684887657
Annotations: k8s.v1.cni.cncf.io/network-status:
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.102.0.9"
],
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status:
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.102.0.9"
],
"default": true,
"dns": {}
}]
openshift.io/scc: restricted
Status: Pending
IP: 10.102.0.9
IPs:
IP: 10.102.0.9
Controlled By: ReplicaSet/sampleapp-6684887657
Containers:
nginx:
Container ID:
Image: quay.io/bitnami/nginx:latest
Image ID:
Port: 80/TCP
Host Port: 0/TCP
State: Waiting
Reason: ImagePullBackOff
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nll5f (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-nll5f:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
ConfigMapName: openshift-service-ca.crt
ConfigMapOptional: <nil>
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 8m27s default-scheduler Successfully assigned sampleapp/sampleapp-6684887657-vqnfw to worker1
Normal AddedInterface 8m25s multus Add eth0 [10.102.0.9/23] from openshift-sdn
Warning Failed 7m24s kubelet Failed to pull image "quay.io/bitnami/nginx:latest": rpc error: code = Unknown desc = pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp 34.225.41.113:443: connect: no route to host
Normal Pulling 6m34s (x4 over 8m25s) kubelet Pulling image "quay.io/bitnami/nginx:latest"
Warning Failed 6m28s (x3 over 8m19s) kubelet Failed to pull image "quay.io/bitnami/nginx:latest": rpc error: code = Unknown desc = pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp 54.144.203.57:443: connect: no route to host
Warning Failed 6m28s (x4 over 8m19s) kubelet Error: ErrImagePull
Warning Failed 6m15s (x6 over 8m18s) kubelet Error: ImagePullBackOff
Normal BackOff 3m20s (x18 over 8m18s) kubelet Back-off pulling image "quay.io/bitnami/nginx:latest"
Expected behavior
A running OKD cluster with 1 Master and 3 Workers
Infrastructure
- Hypervisors OS: CentOS-Stream 8
- CPUs : 32 Cores
- Memory: 128 GB
- HDD: 1TB
Deployment command
odman run --rm -it -v ~/.ssh/id_rsa:/root/.ssh/id_rsa:z -v ~/.ssh/id_rsa.pub:/root/.ssh/id_rsa.pub:z -v ~/.ssh/config:/root/.ssh/cong:z -v ./kubeinit/inventory:/kubeinit/kubeinit/inventory quay.io/kubeinit/kubeinit:2.0.1 -vvv --user root -e kubeinit_spec=okd-libvirt-1-3-1 -i ./kubeinit/inventory ./kubeinit/playbook.yml
Inventory file
# Common variables for the inventory
#
[all:vars]
#
# Internal variables
#
ansible_python_interpreter=/usr/bin/python3
ansible_ssh_pipelining=True
ansible_ssh_common_args='-o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=accept-new'
#
# Inventory variables
#
#
# The default for the cluster name is {{ kubeinit_cluster_distro + 'cluster' }}
# You can override this by setting a specific value in kubeinit_inventory_cluster_name
# kubeinit_inventory_cluster_name=mycluster
kubeinit_inventory_cluster_domain=kubeinit.local
kubeinit_inventory_network_name=kimgtnet0
kubeinit_inventory_network=10.0.0.0/24
kubeinit_inventory_gateway_offset=-2
kubeinit_inventory_nameserver_offset=-3
kubeinit_inventory_dhcp_start_offset=1
kubeinit_inventory_dhcp_end_offset=-4
kubeinit_inventory_controller_name_pattern=controller-%02d
kubeinit_inventory_compute_name_pattern=compute-%02d
kubeinit_inventory_post_deployment_services="none"
#
# Cluster definitions
#
# The networks you will use for your kubeinit clusters. The network name will be used
# to create a libvirt network for the cluster guest vms. The network cidr will set
# the range of addresses reserved for the cluster nodes. The gateway offset will be
# used to select the gateway address within the range, a negative offset starts at the
# end of the range, so for network=10.0.0.0/24, gateway_offset=-2 will select 10.0.0.254
# and gateway_offset=1 will select 10.0.0.1 as the address. Other offset attributes
# follow the same convention.
[kubeinit_networks]
# kimgtnet0 network=10.0.0.0/24 gateway_offset=-2 nameserver_offset=-3 dhcp_start_offset=1 dhcp_end_offset=-4
# kimgtnet1 network=10.0.1.0/24 gateway_offset=-2 nameserver_offset=-3 dhcp_start_offset=1 dhcp_end_offset=-4
# The clusters you are deploying using kubeinit. If there are no clusters defined here
# then kubeinit will assume you are only using one cluster at a time and will use the
# network defined by kubeinit_inventory_network.
[kubeinit_clusters]
# cluster0 network_name=kimgtnet0
# cluster1 network_name=kimgtnet1
#
# If variables are defined in this section, they will have precedence when setting
# kubeinit_inventory_post_deployment_services and kubeinit_inventory_network_name
#
# clusterXXX network_name=kimgtnetXXX post_deployment_services="none"
# clusterYYY network_name=kimgtnetYYY post_deployment_services="none"
#
# Hosts definitions
#
# The cluster's guest machines can be distributed across mutiple hosts. By default they
# will be deployed in the first Hypervisor. These hypervisors are activated and used
# depending on how they are referenced in the kubeinit spec string.
[hypervisor_hosts]
hypervisor-01 ansible_host=nyctea
hypervisor-02 ansible_host=tyto
# The inventory will have one host identified as the bastion host. By default, this role will
# be assumed by the first hypervisor, which is the same behavior as the first commented out
# line. The second commented out line would set the second hypervisor to be the bastion host.
# The final commented out line would set the bastion host to be a different host that is not
# being used as a hypervisor for the guests VMs for the clusters using this inventory.
[bastion_host]
# bastion target=hypervisor-01
# bastion target=hypervisor-02
# bastion ansible_host=bastion
# The inventory will have one host identified as the ovn-central host. By default, this role
# will be assumed by the first hypervisor, which is the same behavior as the first commented
# out line. The second commented out line would set the second hypervisor to be the ovn-central
# host.
[ovn_central_host]
# ovn-central target=hypervisor-01
# ovn-central target=hypervisor-02
#
# Cluster node definitions
#
# Controller, compute, and extra nodes can be configured as virtual machines or using the
# manually provisioned baremetal machines for the deployment.
# Only use an odd number configuration, this means enabling only 1, 3, or 5 controller nodes
# at a time.
[controller_nodes:vars]
os={'cdk': 'ubuntu', 'eks': 'centos', 'k8s': 'centos', 'kid': 'debian', 'okd': 'coreos', 'rke': 'ubuntu'}
disk=120G
ram=25165824
vcpus=8
maxvcpus=16
type=virtual
target_order=hypervisor-01
[controller_nodes]
[compute_nodes:vars]
os={'cdk': 'ubuntu', 'eks': 'centos', 'k8s': 'centos', 'kid': 'debian', 'okd': 'coreos', 'rke': 'ubuntu'}
disk=120G
ram=25165824
vcpus=8
maxvcpus=16
type=virtual
target_order="hypervisor-02,hypervisor-01"
[compute_nodes]
[extra_nodes:vars]
os={'cdk': 'ubuntu', 'okd': 'coreos'}
disk=20G
ram={'cdk': '8388608', 'okd': '16777216'}
vcpus=8
maxvcpus=16
type=virtual
target_order="hypervisor-02,hypervisor-01"
[extra_nodes]
juju-controller distro=cdk
bootstrap distro=okd
# Service nodes are a set of service containers sharing the same pod network.
# There is an implicit 'provision' service container which will use a base os
# container image based upon the service_nodes:vars os attribute.
[service_nodes:vars]
os={'cdk': 'ubuntu', 'eks': 'centos', 'k8s': 'centos', 'kid': 'debian', 'okd': 'centos', 'rke': 'ubuntu'}
target_order=hypervisor-01
[service_nodes]
service services="bind,dnsmasq,haproxy,apache,registry"
Additional context
Add any other context about the problem here.
Looks like a quay issue??
kubelet Failed to pull image "[quay.io/bitnami/nginx:latest](http://quay.io/bitnami/nginx:latest)": rpc error: code = Unknown desc = pinging container registry [quay.io](http://quay.io/): Get "https://quay.io/v2/": dial tcp [54.144.203.57:443](http://54.144.203.57:443/): connect: no route to host
Yeah, it looks like the sampleapp Pod can't reach Quay .... but strange enough okd-cluster-provision can... any tips for troubleshooting this scenario?
Yeah, it looks like the sampleapp Pod can't reach Quay .... but strange enough okd-cluster-provision can... any tips for troubleshooting this scenario?
yeah, that is mostly because for some things we pull from docker (provisioning) and for the infra steps we pull from quay (and the sample app), maybe we could converge to pull everything from the same source.
hummm, it looks more like a routing issue, the worker can't reach quay to pull the image from.
kubelet Failed to pull image "quay.io/bitnami/nginx:latest": rpc error: code = Unknown desc = pinging container registry quay.io: Get "https://quay.io/v2/": dial tcp 34.225.41.113:443: connect: no route to host
After login into worker1:
[core@worker1 ~]$ ssh test@8.8.8.8
ssh: connect to host 8.8.8.8 port 22: No route to host
[core@worker1 ~]$ ssh -p 443 quay.io
ssh: connect to host quay.io port 443: No route to host
@Gl1TcH-1n-Th3-M4tR1x I have been experiencing similar issues across all the distros... This PR fixed the things for the CI #655 Did you manage to pass that problem?
@Gl1TcH-1n-Th3-M4tR1x there was a major breakage because of podman versions not consistent across the different components that are deployed, after #666 I didn't reproduce this anymore.
I can successfully deploy the cluster now.