ofiwg / libfabric

Open Fabric Interfaces

Home Page:http://libfabric.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MPI related: Cannot get multi-provider to work. Should it?

PHHargrove opened this issue · comments

I am trying (and so far failing) to get a "hybrid" HPC application running on an Omni-Path system. By hybrid, I mean it uses both MPI and a second programming model. In this case the second one is using GASNet-EX (which I help maintain). The first happens to be a current release of Intel MPI, in which libfabric is the only inter-node communications API (i.e. no environment setting to tell MPI to use "tcp" other than via the "sockets" or "tcp" providers). Since I am aware that Omni-Path has limitations regarding how many times one can open the NIC, I am trying to avoid the issue by having MPI and GASNet-EX use different providers (such as "sockets" and "psm2").
So my first question:

Is it reasonable to expect libfabric (1.14.0 or newer, fwiw) to support concurrent use of multiple providers?

If the answer to my first question is "No", then you can stop reading here.

The first call to fi_getinfo() is made by Intel MPI, which appears to have set FI_PROVIDER in the environment (the user did not set it, but when GASNet-EX initialization runs later it is already set). Though Intel MPI set FI_PROVIDER=psm2, I have GASNet-EX change that to FI_PROVIDER=sockets before making its call to fi_getinfo(). However, fi_getinfo() returns only entries for the psm2 provider (despite FI_PROVIDER=sockets in the then-current environment) . So, this leads to my second question:

If multi-provider runs are intended to work, how can caching of fi_getinfo() results be suppressed (or the cache invalidated) to allow a second (or subsequent) provider to be discovered?

I should note that "opx" provider crashes for me even in a simple (non-hybrid) "Hello, World!" application. So, this issue is focused on the "psm2" provider.

CC: @bonachea

Did you try something like:

hints->fabric_attr->prov_name = strdup("tcp;ofi_rxm");

Did you try something like:

hints->fabric_attr->prov_name = strdup("tcp;ofi_rxm");

I tried that just now, and it didn't entirely work.

In response to the suggestion, I've set things up so that environment variable GASNET_OFI_PROVIDER (if set) sets hints->fabric_attr->prov_name. Similarly, Intel MPI has I_MPI_OFI_PROVIDER.

I find that setting GASNET_OFI_PROVIDER=sockets (and not setting I_MPI_OFI_PROVIDER), works.
That is a small victory, since I can make GASNet-EX slow and MPI fast.
However, I need the other way around as well.

Surprisingly, GASNET_OFI_PROVIDER=sockets I_MPI_OFI_PROVIDER=psm2 did not work, yielding -FI_ENODATA returns from fi_getinfo() in GASNet-EX's initialization. This should be identical to the working case.

And, alas, the make-gasnet-fast case of GASNET_OFI_PROVIDER=psm2 I_MPI_OFI_PROVIDER=sockets also yields -FI_ENODATA from fi_getinfo().

These two cases in which I set I_MPI_OFI_PROVIDER are where I fear some "caching" of results in libfabric might be involved. Adding unsetenv("FI_PROVIDER"); in addition to setting prov_name did not help, fwiw.

It seems that even in the absence of Intel MPI, FI_PROVIDER=verbs, verbs;ofi_rxm, tcp and tcp;ofi_rxm all fail (though fi_info -l is clear about their presence). So there might be something else wrong on this system. So, I am limiting myself to sockets provider for the moment.

I'll poke at this some more next week to see if I can understand why setting both runtime's provider variables yields -FI_ENODATA.
Any clues would be appreciated.

libfabric supports the use of multiple providers at once. The issue is that the apps are using environment variables to control the behavior, rather than using the API. Setting FI_PROVIDER will add a filter to the provider list that will prevent all other providers from being reported. This filter is applied at library initialization.

I don't know how MPI or GASNet map their environment variables to libfabric.

@shefty I am not setting FI_PROVIDER in GASNet-EX.
Apparently, if I_MPI_OFI_PROVIDER is set, then Intel MPI is setting FI_PROVIDER within MPI_Init().

If I unsetenv("FI_PROVIDER"), then the behavior I see seems as if libfabric is still applying the filter imposed by Intel MPI's prior setting. Since I don't have any control over Intel MPI, is there some way to ensure libfabric discards the out-of-date filter?

The filter is applied once by libfabric at library initialization time. It can't be unset later.

MPI might not unset it if it's set though. You could try setting it yourself before calling MPI. You can specify a list of providers that you want active or a list that you want to ignore (e.g. FI_PROVIDER=^none).

The hybrid application is lauched via mpirun or mpiexec, and runs MPI_Init() before GEX_Client_Init(). So I have no programmatic way to control the environment prior to Intel MPI running.

I tried each of the following manually, all of which result in MPI reporting OFI addrinfo() failed (ofi_init.c:1569:MPIDI_OFI_mpi_init_hook:No data available)

  • env FI_PROVIDER=sockets,psm2 I_MPI_OFI_PROVIDER=sockets mpirun ...
  • env FI_PROVIDER=tcp,psm2 I_MPI_OFI_PROVIDER=tcp mpirun ...
  • env FI_PROVIDER=udp,psm2 I_MPI_OFI_PROVIDER=udp mpirun ...
  • env FI_PROVIDER=verbs,psm2 I_MPI_OFI_PROVIDER=verbs mpirun ...

Replacing the comma in FI_PROVIDER with a space or semi-colon (and adding quoting in both cases) did not help.
The following also fails, and suggests to me Intel MPI is doing something different than I am imagining:

-env FI_PROVIDER=sockets,psm2 I_MPI_OFI_PROVIDER=psm2 mpirun ...

@shefy, if I report this to the Intel MPI team, would I be accurate in reporting that you've stated that their apparent use of the FI_PROVIDER environment variable is not "best practice" for filtering providers?

I would say that the use of environment variables anywhere is never a best practice. :) But, yes, if you report this, I'd request that MPI not set FI_PROVIDER internally, if that is what they are doing.

Have you tried using psm2 for both MPI and Gasnet? @j-xiong - Do you remember if psm2 can support opening multiple endpoints, at least from the perspective of an app above OFI?

@shefty The inability to run both over psm2 was the cause of the work which lead to this issue.
When I do not set any environment variables, GASNet-EX initializes second and I get
*** FATAL ERROR (proc 0): in gasnetc_ofi_init() at [redacted]/upc-runtime/gasnet/ofi-conduit/gasnet_ofi.c:989: fi_endpoint for rdma failed: -22(Invalid argument)

I the absence of Intel MPI, GASNet-EX over pshm2 works "OK" but can only use as many procs as 1/3 the number of CPU cores (or maybe threads?) because we open three endpoints which each appear to consume one psm2 context.

Yes I think psm2 is implemented to allow multiple domains (one for each upper layer library) but I personally didn't test that. Even with that part working, the context limitation is going to be more exaggerated since both MPI and GASNet will each open its own set of endpoints and consume more psm2 contexts.

Would a mix of command line parameters and programmatically setting the provider work for you?

> mpirun -np 1024 ... ./cool-gasnet-application -p sockets

Would a mix of command line parameters and programmatically setting the provider work for you?

> mpirun -np 1024 ... ./cool-gasnet-application -p sockets

I don't see how that would help. All it does is move the passing of the value "sockets" from a new environment variable I created a few days ago to the command line. In either case, the value ends up as a hints->fabric_attr->prov_ name setting as you suggested. Why would the behavior (of Intel MPI in particular) change? Am I missing something?

@zhenggb72 - Please look over this issue. Any ideas of what can be done from the MPI side to select a different provider without impacting gasnet?

Intel MPI does set FI_PROVIDE when I_MPI_OFI_PROVIDER is set.
However I don't know why it still didn't work even after you unsetenv(FI_PROVIDER).

Once libfabric initializes, the filtering for FI_PROVIDER is set.

I'll see if I can come up with some let's-pretend-it's-not-a-hack-but-it-really-is-a-hack solution to make this work.

FI_IGNORE_FI_PROVIDERfor the flags of fi_getinfo ?

@tschuett We actually have such a thing. It's called OFI_GETINFO_HIDDEN, and it's an internal flag, bit 60. This isn't something we want to support, but it's used by providers to find other providers that may be filtered. For example, EFA uses this to get the shared memory provider, even if it's excluded from FI_PROVIDER.

Intel MPI already uses the I_MPI_OFI_PROVIDER setting explicitly to filter out the providers like fabtests' -p parameter does. The internal setting of FI_PROVIDER is mostly likely an unnecessary step that can be removed in future Intel MPI releases.

But OFI_GETINFO_HIDDEN could be a temporary solution for the original author.

But OFI_GETINFO_HIDDEN could be a temporary solution for the original author.

Assuming (1ULL << 60) is the right value, I've just now tried passing it to fi_getinfo() as the flags argument. Combined with I_MPI_OFI_PROVIDER=tcp and hints->fabric_attr->prov_name = strdup("psm2");, this does get me "psm2" entries returned from fi_getinfo() at GASNet-EX initialization. HOWEVER, the very next step is a fi_endpoint() call which is now failing with -22(Invalid argument). So, I suspect something was still "filtered" due to the FI_PROVIDER=tcp at library initialization.

That's the correct value. See include/ofi.h. I mentioned it specifically as a hackable option. :)

The -EINVAL error is likely unrelated to any filtering. gasnet had to open the psm2 fabric and domain objects, so I would assume there's some other issue. Log messages here might show what field is invalid.

Without MPI in the picture, GASNet-EX has no problem, but with the hack above it looks like something is different when MPI is the first to initialize libfabric. The -EINVAL corresponds to the following from FI_LOG_LEVEL=debug:

libfabric:28316:psm2:core:psmx2_trx_ctxt_alloc():275 number of Tx/Rx contexts exceeds limit (1).

Intel MPI may have set PSM2_MULTI_EP=0 and thus forces single psm2 context per rank. Please try set I_MPI_THREAD_EP_MAX to a value greater than 1.

You can also set PSM2_MULTI_EP=1 in advance so it will not be overwritten by Intel MPI.

Isn't only gasnet opening the psm2 endpoint in the above? MPI should be asking for tcp.

Intel MPI sets those environment variables regardless what provider is to be used.

@j-xiong You were correct about environment settings by Intel MPI.

So, I can now finally get the combination of MPI over tcp and GASNet-EX over psm2 (the other way around was simple).
I must use flags = OFI_GETINFO_HIDDEN in my calls the fi_getinfo().
I may use hints->fabric_attr->prov_name = ... to control the providers offered to me (but can skip if I want the default highest-priority provider).
If I wish to use psm2 in GASNet-EX then I must set PSM2_MULTI_EP=1 in the environment of mpirun or mpiexec (or via their command line) to prevent Intel MPI from setting it to 0 within MPI_Init().

Thanks for everyone's help.
Consider this issue to now be a feature request for a "supported" version of the OFI_GETINFO_HIDDEN flag.

The environment variable settings are not overwriting -- if the variable is already set it won't be changed. So instead of using OFI_GETINFO_HIDDEN, one can set FI_PROVIDER=tcp,psm2 and I_MPI_OFI_PROVIDER=tcp so that Intel MPI can pick up tcp and psm2 is still visible to other libraries such as GASNet.

The environment variable settings are not overwriting -- if the variable is already set it won't be changed. So instead of using OFI_GETINFO_HIDDEN, one can set FI_PROVIDER=tcp,psm2 and I_MPI_OFI_PROVIDER=tcp so that Intel MPI can pick up tcp and psm2 is still visible to other libraries such as GASNet.

As I reported earlier, that leads Intel MPI to fail. The following is with a simple MPI hello-world application (so no GASNet-EX):

$ env FI_PROVIDER=tcp,psm2 I_MPI_OFI_PROVIDER=tcp mpirun  --hostfile [...] -np 2 --ppn 1 ./hello_mpi
Abort(1091215) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(138)........:
MPID_Init(1183)..............:
MPIDI_OFI_mpi_init_hook(1569): OFI addrinfo() failed (ofi_init.c:1569:MPIDI_OFI_mpi_init_hook:No data available)

The last two lines of libfabric tracing prior to that message seem to have disqualified tcp provider:

libfabric:9611:core:core:fi_getinfo_():1161<info> Since psm2 can be used, tcp has been skipped. To use tcp, please, set FI_PROVIDER=tcp
libfabric:9611:core:core:fi_getinfo_():1123<warn> Can't find provider with the highest priority

Since one of you is likely to ask:

[0] MPI startup(): Intel(R) MPI Library, Version 2021.4  Build 20210831 (id: 758087adf)
[0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.13.0-impi

Ok, that doesn't work. When both variables are set, Intel MPI would overwrite I_MPI_OFI_PROVIDER setting with the value from FI_PROVIDER. However, I_MPI_OFI_PROVIDER is used as a single provider name and something like tcp,psm2 doesn't fit here.

Before changes can be made to Intel MPI, probably the OFI_GETINFO_HIDDEN option is the closest path for a workaround.

What version of libfabric is this? :)

Yeah, MPI uses an internal version of libfabric, which plays games in the fi_getinfo path. This output is not from the upstream libfabric:

libfabric:9611:core:core:fi_getinfo_():1161<info> Since psm2 can be used, tcp has been skipped. To use tcp, please, set FI_PROVIDER=tcp
libfabric:9611:core:core:fi_getinfo_():1123<warn> Can't find provider with the highest priority

You may get different behavior setting the variables @j-xiong suggests if you force MPI to use an external version, which is probably what you want anyway when using gasnet. Otherwise gasnet is going to pick-up MPI specific modifications to libfabric.

Forcing Intel MPI to use the system install of libfabric 1.14.0 did not change the failure but the tracing output (attached below) is different.

$ env  I_MPI_DEBUG=2 FI_PROVIDER=tcp,psm2 I_MPI_OFI_PROVIDER=tcp mpirun  --hostfile [...] -np 2 --ppn 1 ./hello_mpi
[0] MPI startup(): Intel(R) MPI Library, Version 2021.4  Build 20210831 (id: 758087adf)
[0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation.  All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.14.0
Abort(1091215) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(138)........:
MPID_Init(1183)..............:
MPIDI_OFI_mpi_init_hook(1569): OFI addrinfo() failed (ofi_init.c:1569:MPIDI_OFI_mpi_init_hook:No data available)

Complete output with the addition of FI_LOG_LEVEL=debug is here

Forcing Intel MPI to use the system install of libfabric 1.14.0 did not change [....]

Ah, right. I had started composing that comment before seeing @j-xiong mention that "Intel MPI would overwrite I_MPI_OFI_PROVIDER setting with the [comma-separated list] value from FI_PROVIDER". So, of course the failure is not a function of the libfabric version.

Maybe the result of this discussion can be a multi-provider fabtest.

I doubt a fabtest helps. The problem is Intel MPI is paired with a provider with limited design scope, both of which assume MPI is the only library that apps care about. The fix should start with MPI, so that @j-xiong suggestion of how to set the environment variables can work.

@PHHargrove

It occurs to me that you just need MPI and GASnet to use different interconnect (MPI to use tcp, and GASnet to use lifabric). The requirement of multiple provider came from the fact that intel MPI uses libfabric too.

If that is true, I think you can try to use open mpi, which has its own tcp transport.

In fact, if you compile open mpi without libfabric support, it will not use libfabric at all, but can still use its btl/tcp component to communicate.

@wzamazon wrote

It occurs to me that you just need MPI and GASnet to use different interconnect (MPI to use tcp, and GASnet to use lifabric). The requirement of multiple provider came from the fact that intel MPI uses libfabric too.
[...]

The requirement to use Intel MPI came from the user I am trying to support.
We are both aware of the option to use another MPI, but the user had declined that option before I opened this issue.

[...]
In fact, if you compile open mpi without libfabric support, it will not use libfabric at all, but can still use its btl/tcp component to communicate.

IMO: With a hybrid application the user should have the option to try letting either MPI or GASNet using psm2, to see which way gives the best overall performance (application dependent, and possibly data-dependent). So compiling an MPI without libfabric support wouldn't really be in the user's best interest.

Any progress regarding possibly more friendly behavior by Intel MPI? @j-xiong suggested such might occur in a future Intel MPI here, and @shefty suggested that was preferred to any libfabric changes.

If "no" regasrding progress on the Intel MPI side, any progress toward a public name/variant of OFI_GETINFO_HIDDEN?

No progress. The use of environment variables are part of the problem and should never be set by a library. libfabric even provides a programmatic way to control this! This needs to be reported as a bug against Intel MPI. Let me see where to report MPI bugs.

I don't think it makes sense to expose a programmatic way to override an environment variable.

@PHHargrove - I finally checked with the MPI team. Do you have an Intel TCE or customer support contact for Intel MPI? That's the preferred way of submitting a bug. If not, I can see about submitting an internal ticket.

@PHHargrove - I finally checked with the MPI team. Do you have an Intel TCE or customer support contact for Intel MPI? That's the preferred way of submitting a bug. If not, I can see about submitting an internal ticket.

I do not. This issue is motivated by our efforts to support a user of a hybrid application which uses both MPI and GASNet-EX (over libfabric over psm2). It is possible they have Intel support, but I don't think it is pratical to ask this application user to file a ticket for this.

@PHHargrove - can you send me a direct email? (sean.hefty @ intel.com) I don't see that I have a direct contact for you.

a possible workaround is to set FI_PROVIDER=sockets, and also set I_MPI_OFI_PROVIDER=psm2, before calling Intel MPI.

I believe Intel MPI won't overwrite FI_PROVIDER when it is ready set.

a possible workaround is to set FI_PROVIDER=sockets, and also set I_MPI_OFI_PROVIDER=psm2, before calling Intel MPI.

I believe Intel MPI won't overwrite FI_PROVIDER when it is ready set.

From this comment:

When both variables are set, Intel MPI would overwrite I_MPI_OFI_PROVIDER setting with the value from FI_PROVIDER.

@zhenggb72 - that won't work. If FI_PROVIDER is set, that filter is applied by libfabric and no providers not on that list will be returned. Basically, environment variables, especially at the libfabric level, aren't usable. GasNET has an undocumented work-around for now, until the MPI team can address this.

I see. So looks like in this multi-provider case, no libraries should set FI_PROVIDER. But each module should have their own separate environment variable to choose a provider.
As to intel MPI, I can propose a change so that using I_MPI_OFI_PROVIDER should not overwrite FI_PROVIDER internally. It actually only overwrites FI_PROVIDER when it is NOT set. That was why I thought setting FI_PROVIDER before Intel MPI is a workaround. But I understand from Sean that even that won't work.

As of today, the latest Intel MPI (e.g. 2021.11) no longer overwrites FI_PROVIDER with I_MPI_OFI_PROVIDER setting. That means settings like FI_PROVIDER=tcp,psm2 and I_MPI_OFI_PROVIDER=psm2 would allow Intel MPI to use psm2 while the other model to use tcp.

Closing it as fixed.

@j-xiong

I have not yet been able to produce the desired result of two communications runtimes using different providers when Intel MPI initializes first, but will continue to try.

However, as can be seen bellow, following your advice with the positions of psm2 and tcp reversed to have Intel MPI use the tcp provider while reserving psm2 to the "other" runtime results in warnings from every rank about the two variables having different values. So, even if the problem reported in this issue has been resolved (not yet confirmed) the end-user experience is "poor".

$ env FI_PROVIDER=tcp,psm2 I_MPI_OFI_PROVIDER=tcp mpirun -np 4 ./hello_mpi
MPI startup(): I_MPI_OFI_PROVIDER(tcp) and FI_PROVIDER(tcp,psm2) set to different values, please unset one of the two cvars or set them to the same value
MPI startup(): I_MPI_OFI_PROVIDER(tcp) and FI_PROVIDER(tcp,psm2) set to different values, please unset one of the two cvars or set them to the same value
MPI startup(): I_MPI_OFI_PROVIDER(tcp) and FI_PROVIDER(tcp,psm2) set to different values, please unset one of the two cvars or set them to the same value
MPI startup(): I_MPI_OFI_PROVIDER(tcp) and FI_PROVIDER(tcp,psm2) set to different values, please unset one of the two cvars or set them to the same value
Hello, world from 1 of 4!
Hello, world from 3 of 4!
Hello, world from 2 of 4!
Hello, world from 0 of 4!

Edit:
Here is version info I meant to include:

$ which mpicc
[redacted]/oneapi/mpi/2021.11/bin/mpicc

$ mpirun --version
Intel(R) MPI Library for Linux* OS, Version 2021.11 Build 20231005 (id: 74c4a23)
Copyright 2003-2023, Intel Corporation.

@j-xiong please REOPEN this issue.
While I have confirmed that Intel MPI 2021.11.0 does not overwrite FI_PROVIDER, I find that it still allocates psm2's single context, suggesting that it has initialized the psm2 provider. So, the original issue has not been resolved.

With the new Intel MPI, I still cannot env FI_PROVIDER=tcp,psm2 I_MPI_OFI_PROVIDER=tcp mpirun ... a hybrid job in which MPI initializes first. With the environment variable FI_PROVIDER now preserved, one just reaches the next problem.

When not using Intel MPI, GASNet-EX can use the psm2 provider just fine. However, with Intel MPI initializing first, GASNet-EX fails a fi_endpoint() call with an FI_EINVAL return, which tracing shows is due to:

libfabric:8427:1711604877::psm2:core:psmx2_trx_ctxt_alloc():273<warn> number of Tx/Rx contexts exceeds limit (1).

Use of I_MPI_DEBUG=1 confirms that Intel MPI has selected the tcp;ofi_rxm provider, but somehow it still seems to have "consumed" the (default) one PSM2 context per process.

Currently, I can work-around by setting PSM2_MULTI_EP=1. But that should not be required and my understanding is that it comes with a potential performance cost.

Additionally, even though it is not using the psm2 provider, Intel MPI has set multiple FI_PSM2_* environment variables. Particularly concerning is FI_PSM2_LOCK_LEVEL=0 (default is 2) which disables provider locking and is incorrect for GASNet-EX's thread-safe builds. Fortunately, I find values I set prior to mpirun are preserved.

TL;DR: even when not using the psm2 provider, it appears that Intel MPI continues to initialize and configure it in a manner which prevents another programming model from using it w/o setting multiple environment variables (one of which is beleived to have a performance cost).

Isn’t this an impi problem, not solvable by libfabric?

Isn’t this an impi problem, not solvable by libfabric?

Not sure here.
If the PSM2 context is being allocated explicitly by Intel MPI (such as by a call to fi_fabric() without a balancing fi_close()?), then the solution would be in their code. However, if libfabric has somehow "leaked" the PSM2 context (such as not releasing it upon fi_close()) then the solution would lie in libfabric.
I am not in a position to know which is the case, but am happy to provide trace output if somebody provides the exact FI_LOG_* variables they want set.