canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization

Home Page:https://cloud-init.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Azure]cloud-init's use of HWADDR in ifcfg-eth0 causes systemd-udev error changing net interface name for Accelerated Network in Azure

zhaohuijuan opened this issue · comments

Description of problem:
Accelerated Network in Azure produces hv_netvsc adapter and mlx VF adapter with the same MAC address. The azure cloud-init produces ifcfg for the hv_netvsc with HWADDR. When systemd-udev runs it generates a rename error trying to name the mlx device the same name as assigned to the hv_netvsc device.

The goal of this bug is to find out what is needed to make system-udevd accomodate the accelerated networking in Azure where hv_netvsc and mlx interfaces have the same MAC address. So far I can not see a solution that would stop systemd-udev from trying to name the mlx adapter based on what is in ifcfg-* with it's HWADDR even when there are MATCH's to exclude the mlx adapter.

Version-Release number of selected component (if applicable):
RHEL-8.8/8.9/9.2/9.3
cloud-init-22.1
cloud-init-23.1.1

How reproducible:
Create azure VM with accelerated network adapter and reboot after the install,then find errors like:

Jul 17 21:55:21 novo-rhel8-new systemd-udevd[623]: Error changing net interface name 'eth0' to 'net0': File exists

Steps to Reproduce:

Make Azure VM with:

"imageReference":

{ "communityGalleryImageId": null, "exactVersion": "8.7.2023022801", "id": null, "offer": "RHEL", "publisher": "RedHat", "sharedGalleryImageId": null, "sku": "87-gen2", "version": "latest" }
,

VM using Standard_D2s_v3

$ az network nic show -g (username's-rg) --name rhel8-test779 --query "enableAcceleratedNetworking"
true

Deploy then reboot after the deployment so that the system boots with the ifcfg-eth0 added by cloud-init.

Actual results:

RHEL VM should have

accelerated networking
name hv_netvsc device eth0
name the mlx device enP*
not assign an ip address to the mlx device
properly linked the mlx and hv_netvsc (master eth0 for enP* device)
e.g.

ip a
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 60:45:bd:eb:12:ad brd ff:ff:ff:ff:ff:ff
inet 10.0.0.4/24 brd 10.0.0.255 scope global noprefixroute eth0
valid_lft forever preferred_lft forever
inet6 fe80::6245:bdff:feeb:12ad/64 scope link
valid_lft forever preferred_lft forever
3: enP28653s1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master eth0 state UP group default qlen 1000
link/ether 60:45:bd:eb:12:ad brd ff:ff:ff:ff:ff:ff
altname enP28653p0s2
ethtool -i eth0 |head -2
driver: hv_netvsc
version: 4.18.0-425.13.1.el8_7.x86_64
After the reboot, /var/log/messages contains:

Jul 15 15:04:04 test systemd-udevd[823]: Error changing net interface name 'enP65246s1' to 'eth0': File exists
Jul 15 15:04:04 test systemd-udevd[823]: could not rename interface '3' from 'enP65246s1' to 'eth0': File exists

WHAT CAN BE ADDED TO CONFIGURATION TO STOP system-udev FROM TRYING TO NAME THE mlx ADAPTER BASED ON HWADDR IN /etc/sysconfig/network-scripts/ifcfg-*?

Expected results:

No error.

Additional info:

I also tried adding the following to ifcfg while changing from using eth0 name for the hv_netvsc device:

MATCH_PATH='!-pci'
MATCH_DRIVER="hv_netvsc"

Here is how I did that since the Azure instance's cloud-init overwrites any changes to ifcfg-eth0:

After cloud-init was run to put it's ifcfg-eth0 in place, I did:

rpm -e cloud-init
dracut -f -o "network network-manager ifcfg"
renamed eth0 to net0, to avoid conflicts clouded by the site of eth* naming.
added the MATCH_PATH and MATCH_DRIVER options to ifcfg-net0.

systemd-udev still produces rename error.

Jul 17 21:55:21 rhel8-new systemd-udevd[623]: PROGRAM '/lib/udev/rename_device' /usr/lib/udev/rules.d/60-net.rules:1
Jul 17 21:55:21 rhel8-new systemd-udevd[623]: '/lib/udev/rename_device'(out) 'net0'
Jul 17 21:55:21 rhel8-new systemd-udevd[623]: Process '/lib/udev/rename_device' succeeded.
Jul 17 21:55:21 rhel8-new systemd-udevd[623]: NAME 'net0' /usr/lib/udev/rules.d/60-net.rules:1
Jul 17 21:55:21 rhel8-new systemd-udevd[623]: IMPORT builtin 'net_id' /usr/lib/udev/rules.d/75-net-description.rules:6
Jul 17 21:55:21 rhel8-new systemd-udevd[623]: Using default interface naming scheme 'rhel-8.0'.
Jul 17 21:55:21 rhel8-new systemd-udevd[623]: IMPORT builtin 'hwdb' /usr/lib/udev/rules.d/75-net-description.rules:12
Jul 17 21:55:21 rhel8-new systemd-udevd[623]: IMPORT builtin 'path_id' /usr/lib/udev/rules.d/80-net-setup-link.rules:5
Jul 17 21:55:21 rhel8-new systemd-udevd[623]: IMPORT builtin 'net_setup_link' /usr/lib/udev/rules.d/80-net-setup-link.rules:9
Jul 17 21:55:21 rhel8-new systemd-udevd[623]: Config file /usr/lib/systemd/network/99-default.link applies to device eth0
Jul 17 21:55:21 rhel8-new systemd-udevd[623]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Jul 17 21:55:21 rhel8-new systemd-udevd[623]: RUN '/usr/lib/systemd/systemd-sysctl --prefix=/net/ipv4/conf/$name --prefix=/net/ipv4/neigh/$name --prefix=/net/ipv6/conf/$name --prefix=/net/ipv6/neigh/$name' /usr/lib/udev/rules.d/99-systemd.rules:60
Jul 17 21:55:21 rhel8-new systemd-udevd[623]: Error changing net interface name 'eth0' to 'net0': File exists
Jul 17 21:55:21 rhel8-new systemd-udevd[623]: could not rename interface '3' from 'eth0' to 'net0': File exists
Jul 17 21:55:21 rhel8-new systemd-udevd[623]: Process '/usr/lib/systemd/systemd-sysctl --prefix=/net/ipv4/conf/net0 --prefix=/net/ipv4/neigh/net0 --prefix=/net/ipv6/conf/net0 --prefix=/net/ipv6/neigh/net0' succeeded.

$ cat /etc/sysconfig/network-scripts/ifcfg-net0
BOOTPROTO=dhcp
DEVICE=net0
HWADDR=00:0d:3a:9b:bb:95
ONBOOT=yes
TYPE=Ethernet
USERCTL=no
MATCH_PATH='!-pci'
MATCH_DRIVER="hv_netvsc"

$ nmcli con show
NAME UUID TYPE DEVICE
System net0 33e2dfe5-649b-21d9-4fc5-2feaad011ccc ethernet net0

$ nmcli con show 33e2dfe5-649b-21d9-4fc5-2feaad011ccc | grep match
match.interface-name: –
match.kernel-command-line: –
match.driver: hv_netvsc
match.path: !-pci
$ ip a show
2: net0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:0d:3a:9b:bb:95 brd ff:ff:ff:ff:ff:ff
inet 10.0.0.5/24 brd 10.0.0.255 scope global noprefixroute net0
valid_lft forever preferred_lft forever
inet6 fe80::20d:3aff:fe9b:bb95/64 scope link
valid_lft forever preferred_lft forever
3: eth0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master net0 state UP group default qlen 1000
link/ether 00:0d:3a:9b:bb:95 brd ff:ff:ff:ff:ff:ff
altname enP29499p0s2

$ ethtool -i net0 | grep driver
driver: hv_netvsc
$ ethtool -i eth0 | grep driver
driver: mlx5_core

$ sudo udevadm info --export-db | grep -i -e driver -e "DEVPATH.*eth0" -e "DEVPATH.*net0" -e "ID_PATH" | grep -A 2 DEVPATH
calling: info
E: DEVPATH=/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:00/VMBUS:00/000d3a9b-bb95-000d-3a9b-bb95000d3a9b/net/net0
E: ID_NET_DRIVER=hv_netvsc
E: ID_PATH=acpi-VMBUS:00

E: DEVPATH=/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:00/VMBUS:00/7cb93933-733b-443c-932a-9473684376c8/pci733b:00/733b:00:02.0/net/eth0
E: ID_NET_DRIVER=mlx5_core
E: ID_PATH=acpi-VMBUS:00-pci-733b:00:02.0

@zhaohuijuan , there have been a few changes in the past year or so around dealing with this adapter on Azure.
E.g.,
#1914
#2153

You mentioned some older versions of cloud-init. Is there any way for you to update to the latest version of cloud-init and see if that works any differently?

@zhaohuijuan , there have been a few changes in the past year or so around dealing with this adapter on Azure. E.g., #1914 #2153

You mentioned some older versions of cloud-init. Is there any way for you to update to the latest version of cloud-init and see if that works any differently?

Thanks @TheRealFalcon for the quick response.

I updated to the latest version 23.4 in rhel, it also has this issue.

$ cat /var/log/messages | grep -i 'File exists'
Jan 12 07:51:27 huzhao81001120409-vm1 systemd-udevd[878]: Error changing net interface name 'enP44943s1' to 'eth0': File exists
Jan 12 07:51:27 huzhao81001120409-vm1 systemd-udevd[878]: could not rename interface '3' from 'enP44943s1' to 'eth0': File exists

@zhaohuijuan , thanks for the response. I'll admit that I'm not very familiar with RHEL systems. Can you provide more detailed reproduction steps? How exactly did you launch the RHEL cloud instance and install the latest cloud-init?

Additionally, you may have better luck submitting this as a distro bug to RHEL. While this very well may be an issue that originates in and can be fixed by cloud-init, since it involves interaction with the networking subsystem and udev, along with possible kernel races, there are likely distro-specific interactions that need to be considered. I don't say that to punt the issue; if I can get detailed reproduction steps, I will continue to investigate. It's just that as an upstream, we're not as well equipped to understand and debug those sorts of interactions.

@TheRealFalcon Thanks, I totally understand, we also reported a RHEL bug to track this issue: https://issues.redhat.com/browse/RHEL-7285, the Azure and Red Hat engineers were also working on this. But as you mentioned, this issue maybe involves interaction with the networking subsystem, udev and cloud-init, so I reported here that cloud-init and microsoft community maybe can work together for a fix.

Here is the reproduce steps:

  1. Deploy instance on Azure with rhel image(with cloud-init package pre-installed) and set the network as SRIOV(Accelerated Network)
  2. Login instance, there is no error information
  3. Reboot instance, check the log, there is Error info like below:
    $ cat /var/log/messages | grep -i 'File exists'
    Dec 22 08:20:11 huzhao81012220808-vm1 systemd-udevd[821]: Error changing net interface name 'enP29710s1' to 'eth0': File exists
    Dec 22 08:20:11 huzhao81012220808-vm1 systemd-udevd[821]: could not rename interface '3' from 'enP29710s1' to 'eth0': File exists

Setting this to invalid/incomplete and closing for now. I don't have access to the RHEL bug so I can't see progress there, and additionally from the existing information in this report it doesn't appear that cloud-init is actually doing anything incorrect.

Please report back with more information if you believe that there is work to be done on the cloud-init side of this issue.

@holmanb The RHEL bug[1] is public now, you could access it for more information if needed, thanks!
[1] https://issues.redhat.com/browse/RHEL-7285

@holmanb The RHEL bug is public now, could we re-open this issue for more debugging? Thanks!