openvswitch and routes not working after restart on Rocky-Linux 8.5

Question

openvswitch and routes not working after restart on Rocky-Linux 8.5

Tokix opened this issue 2 years ago · comments

Description
With a bit of modification of the kubeinit files I am able to get okd deployed on rocky-linux 8.5. You can see the modifications here: https://github.com/Tokix/kubeinit I could make a pull request but there is one thing that is not working as expected and that is the restart of the server. After the restart the routes are vanished and I'm not able to reach the frontend anymore.

To Reproduce
Steps to reproduce the behavior:

Install a Redhat 8.5 machine setup ssh connection as nyctea as described in the manual
In my case I had to install python on the hypervisor_host machine addionally before running the playbook successfully

yum install python3

Clone the changes for Rocky8.5

git clone https://github.com/Tokix/kubeinit.git

Run the playbook

ansible-playbook \
    -v --user root \
    -e kubeinit_spec=okd-libvirt-3-1-1 \
    -i ./kubeinit/inventory \
    ./kubeinit/playbook.yml

Enable the frontend

ssh root@nyctea
chmod +x  create-external-ingress.sh
./create-external-ingress.sh

Setup the DNS Entries for your system
check if the url is working (it works at this point):

https://console-openshift-console.apps.okdcluster.kubeinit.local/

reboot the server

init 6

The URL is not working any longer:

https://console-openshift-console.apps.okdcluster.kubeinit.local/

Expected behavior
The external url of the cluster should be available on restart and the routes should be set.

Screenshots
Working route-configuration before the restart:

Route configuration after restart:

Infrastructure

Hypervisors OS: Rocky-Linux
Version 8.5

Deployment command

ansible-playbook \
    -v --user root \
    -e kubeinit_spec=okd-libvirt-3-1-1 \
    -i ./kubeinit/inventory \
    ./kubeinit/playbook.yml

Inventory file diff

I did no changes to the inventory file

Additional context

As selinux is active on rocky-linux 8.5 my first thought was that some changes could not be persisted so I disabled selinux for testing. However it is still not running after restart.

Checked this old issue https://forums.opensuse.org/showthread.php/530879-openvswitch-loses-configuration-on-reboot but it seems that the booting order of openvswitch and network.service is fine.

Furthermore I ran the steps "Attach our cluster network to the logical router" in the file kubeinit/roles/kubeinit_libvirt/tasks/create_network.yml - This got me back to the correct routing table but I'm still not able to reach the guest-systems via 10.0.0.1-x

Is there any script or service that needs or can be re-run to enable the networking after reboot?
In any case I'm thankful for any hints let me know if you need more information.

Thank you in any case for the great project :)

Jeff Bailey · Answer 1 · Sun Apr 24 2022 10:12:00 GMT+0800 (China Standard Time)

I'm also running into a problem with Rocky Linux.

Any help is welcome, this is a cool project, I hope we can get it working on Rocky.

TASK [kubeinit.kubeinit.kubeinit_prepare : Create ssh config file from template] *******************************************************************************
task path: /home/jeff/.ansible/collections/ansible_collections/kubeinit/kubeinit/roles/kubeinit_prepare/tasks/create_host_ssh_config.yml:52
<127.0.0.1> ESTABLISH LOCAL CONNECTION FOR USER: jeff
<127.0.0.1> EXEC /bin/sh -c 'echo ~jeff && sleep 0'
<127.0.0.1> EXEC /bin/sh -c '( umask 77 && mkdir -p "` echo /home/jeff/.ansible/tmp `"&& mkdir "` echo /home/jeff/.ansible/tmp/ansible-tmp-1650765476.136161-198434-213715344205742 `" && echo ansible-tmp-1650765476.136161-198434-213715344205742="` echo /home/jeff/.ansible/tmp/ansible-tmp-1650765476.136161-198434-213715344205742 `" ) && sleep 0'
<127.0.0.1> EXEC /bin/sh -c 'rm -f -r /home/jeff/.ansible/tmp/ansible-tmp-1650765476.136161-198434-213715344205742/ > /dev/null 2>&1 && sleep 0'
The full traceback is:
Traceback (most recent call last):
  File "/home/jeff/kubeinit/kubeinit/lib64/python3.6/site-packages/ansible/template/__init__.py", line 1117, in do_template
    res = j2_concat(rf)
  File "<template>", line 47, in root
  File "/home/jeff/kubeinit/kubeinit/lib64/python3.6/site-packages/jinja2/runtime.py", line 903, in _fail_with_undefined_error
    raise self._undefined_exception(self._undefined_message)
jinja2.exceptions.UndefinedError: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_host'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jeff/kubeinit/kubeinit/lib64/python3.6/site-packages/ansible/plugins/action/template.py", line 146, in run
    resultant = templar.do_template(template_data, preserve_trailing_newlines=True, escape_backslashes=False)
  File "/home/jeff/kubeinit/kubeinit/lib64/python3.6/site-packages/ansible/template/__init__.py", line 1154, in do_template
    raise AnsibleUndefinedVariable(e)
ansible.errors.AnsibleUndefinedVariable: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_host'
fatal: [localhost]: FAILED! => {
    "changed": false,
    "msg": "AnsibleUndefinedVariable: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_host'"
}

PLAY RECAP *****************************************************************************************************************************************************
localhost                  : ok=48   changed=7    unreachable=0    failed=1    skipped=25   rescued=0    ignored=0

Jeff Bailey · Answer 2 · Fri May 06 2022 21:02:10 GMT+0800 (China Standard Time)

My issue isn't specific to Rocky, so I'll add a new issue.

I ran into the same error using Debian.

Edit (Issue added): #647

Carlos Camacho Gonzalez · Answer 3 · Mon Nov 28 2022 04:57:53 GMT+0800 (China Standard Time)

Maybe there are some IPtables rules not persisted after rebooting and I dont have a way to test this on Rocky.

sara · Answer 4 · Wed Feb 22 2023 19:14:55 GMT+0800 (China Standard Time)

Hi @ccamacho,

Thanks for the awesome project. 👍

I am also running into same issue. After reboot, I am not able to reach 10.0.0.x.
Is there a way where we can re enable the networking after reboot?

pyoxa · Answer 5 · Thu Feb 23 2023 23:33:18 GMT+0800 (China Standard Time)

I've got two servers, one with alma 8.x (which also seems to lose connectivity after reboot), and one with centos stream. I could help with providing some debug data, I can sacrifice my currently running clusters if need be.

pyoxa · Answer 6 · Thu Mar 02 2023 20:06:26 GMT+0800 (China Standard Time)

Okay, so the one with CentOS 8 and vanilla k8s didn't persist after restart. The VM's launched fine, but there was no networking. Also, the service pod only had one IP address, from the 10.89.x.x subnet.