Proposal: Link state detection via netlink in ifupdown

Question

Proposal: Link state detection via netlink in ifupdown

nmeum opened this issue 2 years ago · comments

At Alpine, we use ifupdown-ng with BusyBox udhcpc by default. BusyBox udhcpc doesn't do any link state detection and just sends multiple DHCP requests and timeouts if no response it received in a certain timeframe. Since no link state detection is performed by udhcpc itself, some of these requests may fail because the link is not up yet. This can cause udhcpc to run into a timeout without receiving a DHCP lease [1] [2] [3].

Instead of relying on a timeout mechanism in BusyBox udhcpc it would be ideal if ifupdown-ng would only run the DHCP executor after the link came up. On Linux, it is possible to be notified about the link state via netlink(7). For example, refer to bncm-waitif from the bncm network manager. In ifupdown-ng this could be easily integrated by blocking after the pre-up phase until the interface is IFF_UP / IFF_RUNNING and only running the up phase after that's the case. I implemented this idea as a hacky ifupdown-ng executor as a PoC, however, I think it would be preferable to integrate this directly into ifupdown-ng.

Is there any interest in this? If so, what kind of implementation would be acceptable (directly in libifupdown or as an executor)? And: Would it be acceptable to add an (optional?) dependency on libmnl, for interacting with netlink?

Ariadne Conill · Answer 1 · Wed Jun 01 2022 04:46:42 GMT+0800 (China Standard Time)

I like the idea, lets do it.

Sören Tempel · Answer 2 · Sat Jun 04 2022 18:25:05 GMT+0800 (China Standard Time)

I started working on this in #180. Apart from the small stuff like making timeouts etc. configurable, one problem for which I don't have an elegant solution yet is that the link executor is run in the up phase currently. However, my goal would be to only run executors of the up phase when the interface state changed to IFF_UP / IFF_RUNNING (as detected via netlink). As such, the link executor either needs to be moved to the pre-up phase or maybe a new phase needs to be added which is executed between pre-up and up?

Maximilian Wilhelm · Answer 3 · Sat Jun 04 2022 23:23:00 GMT+0800 (China Standard Time)

Hm, I have mixed feelings about this as this introduces significant risk. For example imaging a power outage and everything does a cold start. Switches can take quite a while to boot up fully (like minutes) and especially in a DC environment can take way longer to have their ports ready than servers. So if we bailed out on "meh no link" we would render the entire site offline needing manual intervention. That must not happen.

So if we implement something like this, we need to make the user explicitly request that behavior. I was wondering whether we could derive whether to activate this from "DHCP requested" but I'd even say "no" to that, as IIRC dhclient will keep trying for quite a while (even forever?).

I guess my point is: This feels to my like a special case for udhcpc and more like we want to do allow hot-plug which ifupdown1/2 do, so the interface is configured by udev or something when it get added/online.

Sören Tempel · Answer 4 · Tue Jun 07 2022 23:11:45 GMT+0800 (China Standard Time)

Thanks a lot for your input! Below just some thoughts of mine on the issues you raised.

So if we bailed out on "meh no link" we would render the entire site offline needing manual intervention.

We don't have to bail out after a timeout. We can also block indefinitely by default until the link is up. Though I think a timeout mechanism is useful for desktop systems but this should just be a matter of configuration. In the scenario you are describing some DHCP clients would bail out as well after a timeout (or at least the DHCP executor with udhcpc does so by default). Also: If the link isn't up there is IMHO no point in starting the DHCP client since it won't be able to acquire a lease anyhow. However, blocking indefinitely may be the saner default and it would be entirely possible to implement that with libmnl.

IIRC dhclient will keep trying for quite a while (even forever?).

Not sure about dhclient but udhcpc will retry a few times and then bail out (see the -t, -T, and -A option).
Might also be worthwhile to align the behavior of different DHCP clients supported by the DHCP executor in this regard.

I guess my point is: This feels to my like a special case for udhcpc and more like we want to do allow hot-plug which ifupdown1/2 do, so the interface is configured by udev or something when it get added/online.

I see were you are coming from but I personally don't feel like this is a special case just for udhcpc. If we start a DHCP client before the link is up there a two options regarding what the DHCP client can do: (a) It either needs to implement it's own netlink link state detection (which dhcpcd does for example) or (b) it needs to do some sort of polling and re-try acquiring DHCP leases periodically (which is what udhcpc does). The latter is a frequent source of issues at Alpine and in accordance with the Unix philosophy it seems wrong to implement the former in every DHCP client (or every up executor for that matter). As such, IMHO this should instead be handled by ifupdown-ng.

Maximilian Wilhelm · Answer 5 · Sun Jun 12 2022 07:20:38 GMT+0800 (China Standard Time)

We don't have to bail out after a timeout. We can also block indefinitely by default until the link is up.
That results in the same problem: We will end up with systems which aren't reachable after a (re)boot and people complaining to use - and rightfully so.

My point about allow hot-plug was that this would solve the issue you are facing/describing without breaking anything :-) So I think we should rather focus on this so a user can configure this and request "Only configure this interface once it has a link".

I don't see any scenario in which we can either stop configuring interfaces when one doesn't have a link or wait indefinitely. The only way I could imagine adding such a feature would be if a user explicitly requested this behavior in the interfaces configuration. This cannot become any form of default.

Ariadne Conill · Answer 6 · Sun Jun 12 2022 10:55:58 GMT+0800 (China Standard Time)

I think the way forward is to have an executor which pauses the pipeline until such time that the link is ready.

It could be configurable that way.

Sören Tempel · Answer 7 · Sun Jun 12 2022 17:44:10 GMT+0800 (China Standard Time)

I think the way forward is to have an executor which pauses the pipeline until such time that the link is ready.

It could be configurable that way.

Sure, that is what https://github.com/nmeum/ifupdown-ng-waitif does.

Users have to explicitly configure it, for example via:

iface wlan0
	waitif-timeout 30
	use waitif
	use dhcp

and under the assumption that the waitif executor is executed before the dhcp executor the waitif executor will block in the up phase until the link is ready and only then will ifupdown-ng execute the dhcp executor. One problem with this design though is that the dhcp executor will still be executed if waitif failed (e.g. because of a timeout) due to #179.

I suppose I could sightly cleanup the code and have that executor ship with ifupdown-ng if libmnl support is enabled? We could then modify Alpine's setup-interface to use it by default for udhpc-based interfaces.

kpiq · Answer 8 · Fri Sep 09 2022 00:33:56 GMT+0800 (China Standard Time)

This may be only indirectly connected to the dhcp subject but there's some commonality

Here's the imaginary situation. Let's say the switch port temporarily goes down, for whatever reason, say blocked by STP (unlikely, but follow me). Let's say the switch is the authoritative dhcp server and is improperly configured. Our switch port remains down for any length of time. In the mean time another device is plugged to the switch, issues a DHCP request and the switch fumbles and grants that other device the IP address that our computer was previously using. Today ifupdown-ng would keep our computer's interface up. When our swtich port comes back up ifupdown-ng would remain passive, our computer will try to continue using the same IP address, resulting in a duplicate IP situation. Again, highly unlikely and the fault of a careless admin.

I agree with not making this the default action. But wouldn't it make sense to create an executor that will sense the carrier using netlink and whenever the carrier is lost follow the steps to cleanly bring our interface down, with the opposite happening when the carrier comes back up? It's either that or installing and using netplug, just another piece of unwanted software.