Proposal: Link state detection via netlink in ifupdown
nmeum opened this issue · comments
At Alpine, we use ifupdown-ng with BusyBox udhcpc by default. BusyBox udhcpc doesn't do any link state detection and just sends multiple DHCP requests and timeouts if no response it received in a certain timeframe. Since no link state detection is performed by udhcpc itself, some of these requests may fail because the link is not up yet. This can cause udhcpc to run into a timeout without receiving a DHCP lease [1] [2] [3].
Instead of relying on a timeout mechanism in BusyBox udhcpc it would be ideal if ifupdown-ng would only run the DHCP executor after the link came up. On Linux, it is possible to be notified about the link state via netlink(7)
. For example, refer to bncm-waitif from the bncm network manager. In ifupdown-ng this could be easily integrated by blocking after the pre-up
phase until the interface is IFF_UP
/ IFF_RUNNING
and only running the up
phase after that's the case. I implemented this idea as a hacky ifupdown-ng executor as a PoC, however, I think it would be preferable to integrate this directly into ifupdown-ng.
Is there any interest in this? If so, what kind of implementation would be acceptable (directly in libifupdown or as an executor)? And: Would it be acceptable to add an (optional?) dependency on libmnl, for interacting with netlink?
I like the idea, lets do it.
I started working on this in #180. Apart from the small stuff like making timeouts etc. configurable, one problem for which I don't have an elegant solution yet is that the link
executor is run in the up
phase currently. However, my goal would be to only run executors of the up
phase when the interface state changed to IFF_UP
/ IFF_RUNNING
(as detected via netlink). As such, the link
executor either needs to be moved to the pre-up
phase or maybe a new phase needs to be added which is executed between pre-up
and up
?
Hm, I have mixed feelings about this as this introduces significant risk. For example imaging a power outage and everything does a cold start. Switches can take quite a while to boot up fully (like minutes) and especially in a DC environment can take way longer to have their ports ready than servers. So if we bailed out on "meh no link" we would render the entire site offline needing manual intervention. That must not happen.
So if we implement something like this, we need to make the user explicitly request that behavior. I was wondering whether we could derive whether to activate this from "DHCP requested" but I'd even say "no" to that, as IIRC dhclient
will keep trying for quite a while (even forever?).
I guess my point is: This feels to my like a special case for udhcpc
and more like we want to do allow hot-plug
which ifupdown1/2 do, so the interface is configured by udev
or something when it get added/online.
Thanks a lot for your input! Below just some thoughts of mine on the issues you raised.
So if we bailed out on "meh no link" we would render the entire site offline needing manual intervention.
We don't have to bail out after a timeout. We can also block indefinitely by default until the link is up. Though I think a timeout mechanism is useful for desktop systems but this should just be a matter of configuration. In the scenario you are describing some DHCP clients would bail out as well after a timeout (or at least the DHCP executor with udhcpc
does so by default). Also: If the link isn't up there is IMHO no point in starting the DHCP client since it won't be able to acquire a lease anyhow. However, blocking indefinitely may be the saner default and it would be entirely possible to implement that with libmnl.
IIRC
dhclient
will keep trying for quite a while (even forever?).
Not sure about dhclient
but udhcpc
will retry a few times and then bail out (see the -t
, -T
, and -A
option).
Might also be worthwhile to align the behavior of different DHCP clients supported by the DHCP executor in this regard.
I guess my point is: This feels to my like a special case for
udhcpc
and more like we want to doallow hot-plug
which ifupdown1/2 do, so the interface is configured byudev
or something when it get added/online.
I see were you are coming from but I personally don't feel like this is a special case just for udhcpc
. If we start a DHCP client before the link is up there a two options regarding what the DHCP client can do: (a) It either needs to implement it's own netlink
link state detection (which dhcpcd
does for example) or (b) it needs to do some sort of polling and re-try acquiring DHCP leases periodically (which is what udhcpc
does). The latter is a frequent source of issues at Alpine and in accordance with the Unix philosophy it seems wrong to implement the former in every DHCP client (or every up
executor for that matter). As such, IMHO this should instead be handled by ifupdown-ng
.
We don't have to bail out after a timeout. We can also block indefinitely by default until the link is up.
That results in the same problem: We will end up with systems which aren't reachable after a (re)boot and people complaining to use - and rightfully so.
My point about allow hot-plug
was that this would solve the issue you are facing/describing without breaking anything :-) So I think we should rather focus on this so a user can configure this and request "Only configure this interface once it has a link".
I don't see any scenario in which we can either stop configuring interfaces when one doesn't have a link or wait indefinitely. The only way I could imagine adding such a feature would be if a user explicitly requested this behavior in the interfaces configuration. This cannot become any form of default.
I think the way forward is to have an executor which pauses the pipeline until such time that the link is ready.
It could be configurable that way.
I think the way forward is to have an executor which pauses the pipeline until such time that the link is ready.
It could be configurable that way.
Sure, that is what https://github.com/nmeum/ifupdown-ng-waitif does.
Users have to explicitly configure it, for example via:
iface wlan0
waitif-timeout 30
use waitif
use dhcp
and under the assumption that the waitif
executor is executed before the dhcp
executor the waitif
executor will block in the up
phase until the link is ready and only then will ifupdown-ng
execute the dhcp
executor. One problem with this design though is that the dhcp
executor will still be executed if waitif
failed (e.g. because of a timeout) due to #179.
I suppose I could sightly cleanup the code and have that executor ship with ifupdown-ng if libmnl support is enabled? We could then modify Alpine's setup-interface
to use it by default for udhpc
-based interfaces.
This may be only indirectly connected to the dhcp subject but there's some commonality
Here's the imaginary situation. Let's say the switch port temporarily goes down, for whatever reason, say blocked by STP (unlikely, but follow me). Let's say the switch is the authoritative dhcp server and is improperly configured. Our switch port remains down for any length of time. In the mean time another device is plugged to the switch, issues a DHCP request and the switch fumbles and grants that other device the IP address that our computer was previously using. Today ifupdown-ng would keep our computer's interface up. When our swtich port comes back up ifupdown-ng would remain passive, our computer will try to continue using the same IP address, resulting in a duplicate IP situation. Again, highly unlikely and the fault of a careless admin.
I agree with not making this the default action. But wouldn't it make sense to create an executor that will sense the carrier using netlink and whenever the carrier is lost follow the steps to cleanly bring our interface down, with the opposite happening when the carrier comes back up? It's either that or installing and using netplug, just another piece of unwanted software.