oxidecomputer / omicron

Omicron: Oxide control plane

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

nexus zone failed to come up due to uninitialized underlay that appears to have been initialized?

rcgoodfellow opened this issue · comments

In an a4x2 run, I had a cluster come all the way up, except for a single nexus zone. Attached is the log from the sled agent responsible for launching the missing nexus zone.

The reason for the zone not coming up is sled-agent observing an uninitialized xde underlay state.

22:04:26.535Z WARN SledAgent (ServiceManager): Zone failed to start
    file = sled-agent/src/services.rs:2998
    zone = oxz_nexus_7ed9c5a4-18c2-4704-bd61-a1c12ffdbfe0
22:04:26.536Z INFO SledAgent (dropshot (SledAgent)): request completed
    error_message_external = Internal Server Error
    error_message_internal = Failed to initialize zones: [("oxz_nexus_7ed9c5a4-18c2-4704-bd61-a1c12ffdbfe0", ServicePortCreation { service: "nexus", err: Opte(CommandError(CreateXde, System { errno: 22, msg: "underlay not initialized" })) })]

However, earlier in the log file, sled agent appears to initialize the underlay.

2024-03-22 21:35:23.933Z INFO SledAgent/1376 on g3: using '[AddrObject { interface: "vioif1", name: "ll
" }, AddrObject { interface: "vioif2", name: "ll" }]' as data links for xde driver
    file = illumos-utils/src/opte/illumos.rs:90
    sled_id = 79f20df6-259c-4167-90ae-d840d6d84041

sled-agent-no-nexus.log

Re-launching the environment from scratch did not produce this issue, so it's a non-deterministic one.

My guess here is that the xde module got unloaded between the two events. I see this quite often on the bench with DEBUG OS bits, since there is a periodic module reaper.

                /*
                 * In DEBUG kernels, unheld drivers are uninstalled periodically
                 * every mod_uninstall_interval seconds.  Periodic uninstall can
                 * be disabled by setting mod_uninstall_interval to 0 which is
                 * the default for a non-DEBUG kernel.
                 */

Perhaps xde should refuse to unload once the underlay is first initialised?
It's harder to see how this could happen once other zones using opte are up though.

Quick test to confirm that the module can be unloaded after the underlay is initialised:

gimlet-sn06 # /opt/oxide/opte/bin/opteadm set-xde-underlay cxgbe{0,1}
gimlet-sn06 # /opt/oxide/opte/bin/opteadm set-xde-underlay cxgbe{0,1}
Error: command SetXdeUnderlay failed: System { errno: 17, msg: "underlay already initialized" }
gimlet-sn06 # modunload -i 0
gimlet-sn06 # /opt/oxide/opte/bin/opteadm set-xde-underlay cxgbe{0,1}
gimlet-sn06 #
``

I think if we do so we would also need to add an ioctl to also clear the underlay, if only to allow for easier local testing. AFAIK we remove the underlay by unloading XDE today.

Would we expect this behavior on non-DEBUG kernels? I don't believe I'm running debug bits and this just torpedoed another test run.

I think it's possible modules can be unloaded for reasons other than the DEBUG bits enthusiasm for doing it regularly. If memory is tight, for example, I'm pretty sure one of the things we might try to make more available is unloading things that are willing to unload.

Conceivably it could also be some of our own software doing it; e.g.,

function unload_xde_driver {

I think if we do so we would also need to add an ioctl to also clear the underlay, if only to allow for easier local testing. AFAIK we remove the underlay by unloading XDE today.

I think this is probably critical to do; unloading can happen any time. If your module has discretionary state set up that you don't want to lose (such as the operator having configured the underlay), you need to prevent yourself detaching (if you're something that has instances) or unloading (if you are not). In the case that you don't have a driver instance, I believe you can take a hold on yourself to essentially make yourself busy -- and then release that hold after someone unconfigures the underlay etc.

I think it's possible modules can be unloaded for reasons other than the DEBUG bits enthusiasm for doing it regularly. If memory is tight, for example, I'm pretty sure one of the things we might try to make more available is unloading things that are willing to unload.

Do you know if there are any counters for this (or some other easy way to tell if it's happened)?

Do you know if there are any counters for this (or some other easy way to tell if it's happened)?

You can see how many unloads there have been in general with

kstat -p cpu::sys:modunload

but not anything that would tell you it was xde.

The OPTE-side work is laid out in oxidecomputer/opte#485, which is not at all complex. We can likely integrate it here once #5423 and its followups land, I believe destroy_virtual_hardware is the only place we'd want to clear the underlay to return to a clean-slate state.