Kubeinit / kubeinit

Ansible automation to have a KUBErnetes cluster INITialized as soon as possible...

Home Page:https://www.kubeinit.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

OKD cluster failed to deploy - "container already in use" error is seen in bootstrap

logeshwaris opened this issue · comments

commented

Describe the bug
Trying to deploy OKD cluster with 1 master and 2 worker nodes.
While running the ansible playbook, I see that controller nodes doesn't come to Ready state even after 60tries.
When i logged into the bootstrap node, i see the error "Container already in use". If I remove the container, it comes up fine.
It doesn't throw this error every time. Atleast seen this issue 2 times out of 5.

To Reproduce
Steps to reproduce the behavior:

  1. Clone kubeinit

  2. Run the command

    ansible-playbook
    -v --user root
    -e kubeinit_spec=okd-libvirt-1-2-1
    -i ./kubeinit/inventory
    ./kubeinit/playbook.yml

  3. See error
    BootStrap logs:
    ===============
    Apr 06 10:59:08 bootstrap podman[8121]: 2022-04-06 10:59:08.285934624 +0000 UTC m=+1.377781571 container cleanup 6581fb6d4ff11c4d91635217c4a27d80453>
    Apr 06 10:59:18 bootstrap podman[8219]: 2022-04-06 10:59:07.13491223 +0000 UTC m=+0.082398628 image pull quay.io/openshift/okd-content@sha256:be5eb>
    Apr 06 10:59:19 bootstrap podman[8219]: 2022-04-06 10:59:19.016405592 +0000 UTC m=+11.963891960 container create 92aa4efe11dc0d1e4e99c182b209b9dc6b4>
    Apr 06 10:59:19 bootstrap podman[8219]: 2022-04-06 10:59:19.641219899 +0000 UTC m=+12.588706277 container init 92aa4efe11dc0d1e4e99c182b209b9dc6b468>
    Apr 06 10:59:19 bootstrap podman[8219]: 2022-04-06 10:59:19.674477098 +0000 UTC m=+12.621963466 container start 92aa4efe11dc0d1e4e99c182b209b9dc6b46>
    Apr 06 10:59:19 bootstrap podman[8219]: 2022-04-06 10:59:19.674690723 +0000 UTC m=+12.622177121 container attach 92aa4efe11dc0d1e4e99c182b209b9dc6b4>
    Apr 06 10:59:20 bootstrap systemd[1]: Stopping Bootstrap a Kubernetes cluster...
    Apr 06 10:59:20 bootstrap bootkube.sh[9514]: open pidfd: No such process
    Apr 06 10:59:20 bootstrap bootkube.sh[8219]: time="2022-04-06T10:59:20Z" level=error msg="Error forwarding signal 15 to container 92aa4efe11dc0d1e4e>
    Apr 06 10:59:20 bootstrap bootkube.sh[2056]: Terminated
    Apr 06 10:59:20 bootstrap podman[9521]: 2022-04-06 10:59:20.28349949 +0000 UTC m=+0.040186130 container died 92aa4efe11dc0d1e4e99c182b209b9dc6b46848>

    Apr 06 10:59:20 bootstrap systemd[1]: bootkube.service: Deactivated successfully.
    Apr 06 10:59:20 bootstrap systemd[1]: Stopped Bootstrap a Kubernetes cluster.
    Apr 06 10:59:20 bootstrap systemd[1]: bootkube.service: Consumed 33.650s CPU time.
    Apr 06 10:59:20 bootstrap systemd[1]: release-image.service: Deactivated successfully.
    Apr 06 10:59:20 bootstrap systemd[1]: Stopped Download the OpenShift Release Image.
    Apr 06 10:59:20 bootstrap systemd[1]: release-image.service: Consumed 12.351s CPU time.
    -- Boot c93e0d5bc8b44038b0d5d265ed467c93 --
    Apr 06 10:59:31 bootstrap systemd[1]: Starting Download the OpenShift Release Image...
    Apr 06 10:59:31 bootstrap release-image-download.sh[966]: Pulling service.okdcluster.kubeinit.local:5000/okd@sha256:7d8356245fc3a75fe11d1832ce9fef17>
    Apr 06 10:59:32 bootstrap podman[1015]: 2022-04-06 10:59:32.079196063 +0000 UTC m=+0.961207467 system refresh
    Apr 06 10:59:32 bootstrap release-image-download.sh[1015]: 5c93a0adf473e01f1bd88d3e539dbbe6de5bcfb74eace85038a63490f9603143
    Apr 06 10:59:32 bootstrap podman[1015]: 2022-04-06 10:59:32.080829538 +0000 UTC m=+0.962840932 image pull service.okdcluster.kubeinit.local:5000/ok>
    Apr 06 10:59:33 bootstrap systemd[1]: Finished Download the OpenShift Release Image.
    Apr 06 10:59:41 bootstrap systemd[1]: Started Bootstrap a Kubernetes cluster.
    .
    .
    .
    .
    .
    .
    Apr 06 11:35:46 bootstrap podman[308085]: 2022-04-06 11:35:46.277425197 +0000 UTC m=+0.499459314 container remove d5440565f1b94e5a176c11750c60d4d45861976990b4f5f1aa56bdace09eb412 (image=service.okdcluster.kubeinit.local:5000/okd@sha256:7d8356245fc3a75fe11d1832ce9fef17f3dd0f2ea6f38271319c95918416b9d9, name=quizzical_ellis, io.openshift.release=4.9.0-0.okd-2021-11-28-035710, io.openshift.release.base-image-digest=sha256:24a6759ce7d34123ae68ee14ee2a7c52ec3b2c7a5ae65cf87651176661e55e58)
    Apr 06 11:35:46 bootstrap bootkube.sh[306030]: Rendering Kubernetes API server core manifests...
    Apr 06 11:35:46 bootstrap bootkube.sh[308213]: Error: error creating container storage: the container name "kube-apiserver-render" is already in use by "92aa4efe11dc0d1e4e99c182b209b9dc6b468483438865d8a2bcef825b22c65b". You have to remove that container to be able to reuse that name.: that name is already in use

    Apr 06 11:35:46 bootstrap systemd[1]: bootkube.service: Main process exited, code=exited, status=125/n/a
    Apr 06 11:35:46 bootstrap systemd[1]: bootkube.service: Failed with result 'exit-code'.
    Apr 06 11:35:46 bootstrap systemd[1]: bootkube.service: Consumed 4.452s CPU time.

[core@bootstrap ~]$ sudo podman ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
ed027262a5fa service.okdcluster.kubeinit.local:5000/okd@sha256:7d8356245fc3a75fe11d1832ce9fef17f3dd0f2ea6f38271319c95918416b9d9 render --output-d... 38 minutes ago Exited (0) 38 minutes ago cvo-render
95b964e69d58 quay.io/openshift/okd-content@sha256:8c24b5ca67f5cd7763dbcb1586cfcfcff2083eae137acfea6f9b0468fcd2e8e6 /usr/bin/cluster-... 37 minutes ago Exited (0) 37 minutes ago etcd-render
6581fb6d4ff1 quay.io/openshift/okd-content@sha256:5a262a1ca5b05a174286494220a1f583ed1fcb2fb60114aae25f6d2670699746 /usr/bin/cluster-... 37 minutes ago Exited (0) 37 minutes ago config-render
92aa4efe11dc quay.io/openshift/okd-content@sha256:be5eb9ef4a8c26ce7e5827285a4e65620aa7b31c9fb203e046c900a45b095764 /usr/bin/cluster-... 36 minutes ago Created kube-apiserver-render
[core@bootstrap ~]$

Expected behavior
Running OKD Cluster with 1 master and 2 worker nodes.

Infrastructure
Hypervisors OS: CentOS-Stream 8
CPUs : 32 Cores
Memory: 128 GB
HDD: 1TB

Deployment command

ansible-playbook
-v --user root
-e kubeinit_spec=okd-libvirt-1-2-1
-i ./kubeinit/inventory
./kubeinit/playbook.yml

Inventory file diff
diff --git a/kubeinit/inventory b/kubeinit/inventory
index bbb380d..d862b0e 100644
--- a/kubeinit/inventory
+++ b/kubeinit/inventory
@@ -72,8 +72,8 @@ kubeinit_inventory_network_name=kimgtnet0

[hypervisor_hosts]

hypervisor-01 ansible_host=nyctea
-hypervisor-02 ansible_host=tyto
-# hypervisor-01 ansible_host=nyctea ssh_hostname=server1.example.com
+#hypervisor-02 ansible_host=tyto
+# hypervisor-01 ansible_host=nyctea
.
.
.
[controller_nodes:vars]
os={'cdk': 'ubuntu', 'eks': 'centos', 'k8s': 'centos', 'kid': 'debian', 'okd': 'coreos', 'rke': 'ubuntu'}
-disk=25G
+disk=150G
ram=25165824
vcpus=8
maxvcpus=16
@@ -152,8 +152,8 @@ target_order=hypervisor-01

[compute_nodes:vars]
os={'cdk': 'ubuntu', 'eks': 'centos', 'k8s': 'centos', 'kid': 'debian', 'okd': 'coreos', 'rke': 'ubuntu'}
-disk=30G
-ram=8388608
+disk=100G
+ram=16777216
vcpus=8
maxvcpus=16
type=virtual

Hi @logeshwaris from what I was able to see, the error looks like something specific to OKD, instead of the automation to get it deployed. Currently, we are deploying 4.9 but there are newer versions available, let me see if by updating the version the problem goes away.

Let's see how it goes here #643

commented

Hi @ccamacho, I tried using the latest and i am seeing the below error. Am I missing something?

Command:
ansible-playbook
-v --user root
-e kubeinit_spec=okd-libvirt-1-2-1
-i ./kubeinit/inventory
./kubeinit/playbook.yml

Logs:
TASK [kubeinit.kubeinit.kubeinit_prepare : Create ssh config file from template] **********************
task path: /home/slogeshw/.ansible/collections/ansible_collections/kubeinit/kubeinit/roles/kubeinit_prepare/tasks/create_host_ssh_config.yml:53
Monday 11 April 2022 11:23:31 +0530 (0:00:00.209) 0:00:16.327 **********
<127.0.0.1> ESTABLISH LOCAL CONNECTION FOR USER: slogeshw
<127.0.0.1> EXEC /bin/sh -c 'echo ~slogeshw && sleep 0'
<127.0.0.1> EXEC /bin/sh -c '( umask 77 && mkdir -p "echo /home/slogeshw/.ansible/tmp"&& mkdir "echo /home/slogeshw/.ansible/tmp/ansible-tmp-1649656411.4216487-1386120-60959960057760" && echo ansible-tmp-1649656411.4216487-1386120-60959960057760="echo /home/slogeshw/.ansible/tmp/ansible-tmp-1649656411.4216487-1386120-60959960057760" ) && sleep 0'
<127.0.0.1> EXEC /bin/sh -c 'rm -f -r /home/slogeshw/.ansible/tmp/ansible-tmp-1649656411.4216487-1386120-60959960057760/ > /dev/null 2>&1 && sleep 0'
**The full traceback is:
**Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/ansible/template/init.py", line 1100, in do_template
res = j2_concat(rf)
File "", line 47, in root
File "/usr/local/lib/python3.6/site-packages/jinja2/runtime.py", line 903, in _fail_with_undefined_error
raise self._undefined_exception(self._undefined_message)
jinja2.exceptions.UndefinedError: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_host'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/ansible/plugins/action/template.py", line 146, in run
resultant = templar.do_template(template_data, preserve_trailing_newlines=True, escape_backslashes=False)
File "/usr/local/lib/python3.6/site-packages/ansible/template/init.py", line 1137, in do_template
raise AnsibleUndefinedVariable(e)
ansible.errors.AnsibleUndefinedVariable: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_host'
fatal: [localhost]: FAILED! => {
"changed": false,
"msg": "AnsibleUndefinedVariable: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_host'"**
}**

PLAY RECAP ********************************************************************************************
localhost : ok=50 changed=7 unreachable=0 failed=1 skipped=23 rescued=0 ignored=0

Inventory.yml:
=============


###
# The cluster's guest machines can be distributed across mutiple hosts. By default they
# will be deployed in the first Hypervisor. These hypervisors are activated and used
# depending on how they are referenced in the kubeinit spec string.
#
# When we are running the setup-playbook, if a hypervisor host has an ssh_hostname attribute
# then a .ssh/config file will be created and an entry mapping the ansible_host to that
# ssh hostname will be created. In the first example we would associate
# the ansible_host of the first hypervisor host "nyctea" with the hostname provided, it
# can be a short or fully qualified name, but it needs to be resolvable on the host we
# are running the kubeinit setup from. The second example uses a host ip address, which
# can be useful in those cases where the host you are using doesn't have a dns name.
#
# .. code-block:: yaml
#
#    hypervisor_hosts:
#        hypervisor-01:
#            ansible_host: nyctea
#            ssh_hostname: server1.example.com
#        hypervisor-02:
#            ansible_host: tyto
#            ssh_hostname: 192.168.222.202
hypervisor_hosts:
  hypervisor-01:
    ansible_host: nyctea

###
# The inventory will have one host identified as the bastion host. By default, this role will
# be assumed by the first hyperviso. The example would set the second hypervisor to be the bastion host.
# The final example would set the bastion host to be a different host that is not
# being used as a hypervisor for the guests VMs for the clusters using this inventory.
#
# .. code-block:: yaml
#
#    bastion_host:
#        bastion:
#            ansible_host: hypervisor-02
#
# .. code-block:: yaml
#
#    bastion_host:
#        bastion:
#            ansible_host: bastion
bastion_host:
  bastion:
    target: hypervisor-01

###
# The inventory will have one host identified as the ovn-central host. By default, this role
# will be assumed by the first hypervisor. The first example would set the second hypervisor
# to be the ovn-central host.
#
# .. code-block:: yaml
#
#    ovn_central_host:
#        target: hypervisor-02
ovn_central_host:
  ovn-central:
    target: hypervisor-01

###
#
# Setup host definition (used only with the setup-playbook.yml)
#
#
# This inventory will have one host identified as the setup host. By default, this will be
# localhost. The first example would set the first hypervisor host to be the setup host.
# The last example would set the setup host to be a different host that is not being used
# as a hypervisor in this inventory.
#
# .. code-block:: yaml
#
#    setup_host:
#        ansible_host: nyctea
#
# or
#
# .. code-block:: yaml
#
#    setup_host:
#        ansible_host: 192.168.222.214
setup_host:
  kubeinit-setup:
    ansible_host: nyctea
~                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
~                                    

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days

Lot of things got broken because of podman not being consistent across the components we deploy, this should be fixed by #666 now

commented

Hi @ccamacho,

Any idea why i see the below error is seen, with this i am not able to proceed my testing in the main branch.
Is there anything wrong with my inventory.yml file?

Logs:
TASK [kubeinit.kubeinit.kubeinit_prepare : Create ssh config file from template] **********************
task path: /home/slogeshw/.ansible/collections/ansible_collections/kubeinit/kubeinit/roles/kubeinit_prepare/tasks/create_host_ssh_config.yml:53
Monday 11 April 2022 11:23:31 +0530 (0:00:00.209) 0:00:16.327 **********
<127.0.0.1> ESTABLISH LOCAL CONNECTION FOR USER: slogeshw
<127.0.0.1> EXEC /bin/sh -c 'echo ~slogeshw && sleep 0'
<127.0.0.1> EXEC /bin/sh -c '( umask 77 && mkdir -p "echo /home/slogeshw/.ansible/tmp"&& mkdir "echo /home/slogeshw/.ansible/tmp/ansible-tmp-1649656411.4216487-1386120-60959960057760" && echo ansible-tmp-1649656411.4216487-1386120-60959960057760="echo /home/slogeshw/.ansible/tmp/ansible-tmp-1649656411.4216487-1386120-60959960057760" ) && sleep 0'
<127.0.0.1> EXEC /bin/sh -c 'rm -f -r /home/slogeshw/.ansible/tmp/ansible-tmp-1649656411.4216487-1386120-60959960057760/ > /dev/null 2>&1 && sleep 0'
**The full traceback is:
**Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/ansible/template/init.py", line 1100, in do_template
res = j2_concat(rf)
File "", line 47, in root
File "/usr/local/lib/python3.6/site-packages/jinja2/runtime.py", line 903, in _fail_with_undefined_error
raise self._undefined_exception(self._undefined_message)
jinja2.exceptions.UndefinedError: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_host'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/ansible/plugins/action/template.py", line 146, in run
resultant = templar.do_template(template_data, preserve_trailing_newlines=True, escape_backslashes=False)
File "/usr/local/lib/python3.6/site-packages/ansible/template/init.py", line 1137, in do_template
raise AnsibleUndefinedVariable(e)
ansible.errors.AnsibleUndefinedVariable: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_host'
fatal: [localhost]: FAILED! => {
"changed": false,
"msg": "AnsibleUndefinedVariable: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_host'"
}

Inventory.yml:
=============


###
# The cluster's guest machines can be distributed across mutiple hosts. By default they
# will be deployed in the first Hypervisor. These hypervisors are activated and used
# depending on how they are referenced in the kubeinit spec string.
#
# When we are running the setup-playbook, if a hypervisor host has an ssh_hostname attribute
# then a .ssh/config file will be created and an entry mapping the ansible_host to that
# ssh hostname will be created. In the first example we would associate
# the ansible_host of the first hypervisor host "nyctea" with the hostname provided, it
# can be a short or fully qualified name, but it needs to be resolvable on the host we
# are running the kubeinit setup from. The second example uses a host ip address, which
# can be useful in those cases where the host you are using doesn't have a dns name.
#
# .. code-block:: yaml
#
#    hypervisor_hosts:
#        hypervisor-01:
#            ansible_host: nyctea
#            ssh_hostname: server1.example.com
#        hypervisor-02:
#            ansible_host: tyto
#            ssh_hostname: 192.168.222.202
hypervisor_hosts:
  hypervisor-01:
    ansible_host: nyctea

###
# The inventory will have one host identified as the bastion host. By default, this role will
# be assumed by the first hyperviso. The example would set the second hypervisor to be the bastion host.
# The final example would set the bastion host to be a different host that is not
# being used as a hypervisor for the guests VMs for the clusters using this inventory.
#
# .. code-block:: yaml
#
#    bastion_host:
#        bastion:
#            ansible_host: hypervisor-02
#
# .. code-block:: yaml
#
#    bastion_host:
#        bastion:
#            ansible_host: bastion
bastion_host:
  bastion:
    target: hypervisor-01

###
# The inventory will have one host identified as the ovn-central host. By default, this role
# will be assumed by the first hypervisor. The first example would set the second hypervisor
# to be the ovn-central host.
#
# .. code-block:: yaml
#
#    ovn_central_host:
#        target: hypervisor-02
ovn_central_host:
  ovn-central:
    target: hypervisor-01

###
#
# Setup host definition (used only with the setup-playbook.yml)
#
#
# This inventory will have one host identified as the setup host. By default, this will be
# localhost. The first example would set the first hypervisor host to be the setup host.
# The last example would set the setup host to be a different host that is not being used
# as a hypervisor in this inventory.
#
# .. code-block:: yaml
#
#    setup_host:
#        ansible_host: nyctea
#
# or
#
# .. code-block:: yaml
#
#    setup_host:
#        ansible_host: 192.168.222.214
setup_host:
  kubeinit-setup:
    ansible_host: nyctea

~
~

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days