linux-nvme / nvme-cli

NVMe management command line interface.

Home Page:https://nvmexpress.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RFE: Perform actual discovery in discover_from_nbft()

tbzatek opened this issue · comments

Assume the NVMe/TCP boot attempts are configured to point to discovery controllers.

As called e.g. by nvme connect-all --nbft, discover_from_nbft() uses SSNS records for actual connection. This is okay since it's primary the pre-OS (UEFI) driver responsibility to perform actual discovery.

However, in case the defined discovery controller is inaccessible for some reason, e.g. in case of a broken multipath, the SSNS records won't get populated obviously. Still, the NBFT Discovery Descriptor List will likely contain original DC entries and userspace may want to perform additional discovery and connection (e.g. when a path comes back up).

SSNS records carry a link to the Discovery Controller list and a 'discovered' flag so that there's actualy an evidence of pre-OS discovery attempts.

Mechanisms for calling nvme connect-all --nbft are being discussed in #2179.

Thanks for bringing this up. I thought that we already used the discovery records, but indeed we don't 😁
Let's involve @Douglas-Farley and @stuarthayes here.

TL;DR: Yes, I think we should do this.

The Boot spec is rather vague about the subject of discovery. What does the firmware actually write to the NBFT if it is configured to do discovery, after having found namespaces(s) for booting?

  1. Write only discovery descriptor(s) to NBFT, and leave it to the OS to do the discovery,
  2. Write only the SNSS descriptor(s) used for booting to the NBFT, plus the discovery controllers that were used to obtain these records,
  3. Write all discovered SNSS descriptor(s) to the NBFT (note that this list might be long), plus the associated discovery controllers.

Figure 15 of the NVMe boot spec says about the Primary Discovery Controller Index in the SSNS record: "If a Discovery controller was used to establish this record this value shall be set to a non-zero value", and "shall" means mandatory, so the FW is not allowed to omit the discovery controller record(s) (although booting would probably still succeed if it did, and simply didn't set the "discovered" flag).

I see nothing in the spec that would forbid 1).

What should the OS do? IMO:

  1. If there are no SSNS records but discovery records, it should attempt discovery, attempt to connect to all discovered subsystems, and try to find the root file system on those.
  2. If there are SSNS records, during boot: it should attempt to connect to the SSNS specified in the NBFT, and (only) if that fails, fall back to discovery.
  3. If there are SSNS records, after booting: it should do discovery from all listed DCs (plus all else that might be configured in the OS itself).

I wonder if we need a command line option to tell nvme-cli whether or not it should try to use NBFT discovery records, to differentiate 5. and 6.

There's one corner case for 5): connection to the SSNS record(s) listed in the NBFT succeeds, but the root FS is not found on any of these namespaces. Maybe discovery would turn up additional subsystems/namespaces that contain the root FS. This would arguably be a misconfiguration. I wonder if we need to be able to deal with it.

Alternatively, we could always do 6) (discovery from NBFT discovery descriptors), but depending on the setup, that might delay and complicate booting if the discovery controllers have lots of target records to connect to.

In practice, if NVMeoF boot is used with discovery, I expect that customers won't configure the environment such that a boot environment would see hundreds of namespaces. The firmware would need a long long time to connect to all the subsystems and might actually fail if the list is too long (not sure what EFI can do in this respect).

This, in turn, would mean that either dedicated discovery controllers would be used for booting, or that we would use a different host NQN at boot time (and the discovery controller would hand over only a subset of the available records to the "boot" host NQN).

However, in case the defined discovery controller is inaccessible for some reason, e.g. in case of a broken multipath, the SSNS records won't get populated obviously

I don't quite get this. If the FW wasn't able to obtain any SSNS records, how did it boot the OS, after all? Also, AFAICS the spec doesn't mandate that the FW write discovery records for non-functional discovery controllers into the NBFT.

IMO a more likely scenario is that the SSNS record from the NBFT is inaccessible but discovery still works (and possibly turns up a different SSNS record with an alternative path to the root FS). This would be covered by 5) above.

Thanks for your reply, there are some good points that I haven't really thought about.

I really like the current nvme connect-all --nbft behaviour: connect whatever has not been connected yet or failed in the past, gracefully skipping already established connections. For discovery, I thought it might work the similar way.

  1. Write all discovered SNSS descriptor(s) to the NBFT (note that this list might be long), plus the associated discovery controllers.

This is what I've been seeing with the Dell hardware, haven't tried with EDK2. See the sample table I added in linux-nvme/libnvme#781.

Figure 15 of the NVMe boot spec says about the Primary Discovery Controller Index in the SSNS record: "If a Discovery controller was used to establish this record this value shall be set to a non-zero value", and "shall" means mandatory, so the FW is not allowed to omit the discovery controller record(s) (although booting would probably still succeed if it did, and simply didn't set the "discovered" flag).

I see nothing in the spec that would forbid 1).

Either way, we won't be able to do much with that in userspace.

What should the OS do? IMO:

  1. If there are no SSNS records but discovery records, it should attempt discovery, attempt to connect to all discovered subsystems, and try to find the root file system on those.

Totally agree.

  1. If there are SSNS records, during boot: it should attempt to connect to the SSNS specified in the NBFT, and (only) if that fails, fall back to discovery.

during boot = dracut?
with primary focus on providing rootfs

I would suggest to take SSNS Primary Discovery Controller Index in account. I.e. if there's a Discovery record but none of the SSNS records are pointing to it, discovery should be performed. If there's is at least one SSNS associated, no re-discovery should be made during dracut phase (or perhaps it should in case all SSNS connections from this DC failed).

  1. If there are SSNS records, after booting: it should do discovery from all listed DCs (plus all else that might be configured in the OS itself).

after booting = post-switchroot?
with primary focus on providing everything else needed for the OS

We'll need to handle situations where some block devices referenced in fstab may need to be present (and systemd will wait for them, blocking startup). In such case, some service needs to take care of this in early boot phase. Think of placing /var or /srv on a different SSNS than rootfs.

Sounds like we may need to split nvmf-connect-nbft.service into two: one that should run before and another one that will be trigerred by network management service on link changes.

I wonder if we need a command line option to tell nvme-cli whether or not it should try to use NBFT discovery records, to differentiate 5. and 6.

It would certainly come handy for debugging, even if not actually used.

There's one corner case for 5): connection to the SSNS record(s) listed in the NBFT succeeds, but the root FS is not found on any of these namespaces. Maybe discovery would turn up additional subsystems/namespaces that contain the root FS. This would arguably be a misconfiguration. I wonder if we need to be able to deal with it.

Hmm, this may be very well caused by quirky pre-OS driver (wrt. case 1.)). Of course, certain responsibility falls down on the admin - a good practice should suggest placing EFI system partition on the same namespace with the rootfs.

Alternatively, we could always do 6) (discovery from NBFT discovery descriptors), but depending on the setup, that might delay and complicate booting if the discovery controllers have lots of target records to connect to.

I tend to agree with the split. Would be interesting to get more opinions... @johnmeneghini @igaw ?

In practice, if NVMeoF boot is used with discovery, I expect that customers won't configure the environment such that a boot environment would see hundreds of namespaces. The firmware would need a long long time to connect to all the subsystems and might actually fail if the list is too long (not sure what EFI can do in this respect).

It already takes 2-3 minutes with our four-DC testbed. This might take much more time during pre-OS connection phase and potentially a similar amount during dracut phase. Might even possibly hit some global timeout.

Shall we perhaps move NBFT connection attempts into threads within nvme-cli? :-)

However, in case the defined discovery controller is inaccessible for some reason, e.g. in case of a broken multipath, the SSNS records won't get populated obviously

I don't quite get this. If the FW wasn't able to obtain any SSNS records, how did it boot the OS, after all? Also, AFAICS the spec doesn't mandate that the FW write discovery records for non-functional discovery controllers into the NBFT.

I was trying to describe the case 4.) for e.g. a secondary path that is down, thus unable to discover during pre-OS phase. With expectation that re-discovery would happen later once the network is available.

IMO a more likely scenario is that the SSNS record from the NBFT is inaccessible but discovery still works (and possibly turns up a different SSNS record with an alternative path to the root FS). This would be covered by 5) above.

Yes, precisely.

Our original intent was that any namespace that is discovered (directly or via a discovery subsystem), will have a SNSS entry. A complaint pre-OS driver will also have minimally populated a SNSS record for each NID in question.

Those namespaces should have all come from something like an attempt structure in the edk2 reference; and those attempts are administratively populated either towards an IO or Discovery subsystem. If it was a discovery subsystem, then there should have been a discovery record created an the SNSS in question should point to that discovery record. Perhaps a pre-OS driver, with mDNS, might populate the discovery table without there being an SNSS record (i.e. an unprovisioned host), but if there were any discovered namespaces they should both have a SNSS and point back to the first discovery controller in the chain. Starting from the discovery controller from an SNSS record is better because that allows for administrative orchestration and re-location of the underlying resource and resilience to better routes and topology steering.

My point being; from the POV of an OS application trying to re-establish connectivity I would start from the discovery subsystem indicated in the SNSS record and search for the NID. If you can not find the NID from that direction, then falling back to direct connecting to the SNSS indicated IO subsystem is reasonable, but if there was a Discovery Subsys then obviously it had to have been administratively specified.

Recapping:

  1. Read SNSS, check for a Discovery Record
    a) IF Discovery Record is found, start from there
    b) If Discovery Record is not found, or controller inaccessible, or the NID is not reachable/findable go to (2)
  2. Fall back to the IO controller as specified in the SNSS

What is the status here? I lost the overview...

What is the status here? I lost the overview...

Working on it, will publish something hopefully next week or the week after. Now that the firmware has gotten related fixes through timberland-sig/edk2#35 I'm going to use it as a testing baseline.

This ticket has grown large and might use breaking into smaller parts for the boot phases discussion. Lots of loose ideas though.

Okay, posted #2315 as my first attempt for the actual discovery from NBFT implementation. Test setup described in linux-nvme/libnvme#821, generated by timberland-sig/edk2#35.

This is roughly how it's currently set to work:

  1. Read SSNS, skip discovery NQN, otherwise make a connection attempt
  2. Read Discovery Descriptor List, check if there are any SSNS records referencing it (indicating successful discovery during the pre-OS phase). Skip if so, otherwise perform discovery and do connect-all

Good to close?

Good to close?

Well, there are some thoughts in the above conversation that might be worth implementing. We're currently working with @johnmeneghini to identify further requirements and perhaps a way how NBFT discovery is done.