ofiwg / libfabric

Open Fabric Interfaces

Home Page:http://libfabric.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

prov/opx: provider is reported even when OPA hw is not present

j-xiong opened this issue · comments

Describe the bug
On machines w/o OPA hardware, fi_getinfo() still returns opx as a valid provider.

To Reproduce
Run fi_info -p opx on systems w/o OPA hw and get the result:

provider: opx
    fabric: OPX-100
    domain: hfi1
    version: 115.0
    type: FI_EP_UNSPEC
    protocol: FI_PROTO_OPX

In addition, the opx provider is listed before the verbs provider so applications that use the first returned provider may end up not working properly.

Expected behavior
fi_getinfo returns -61

Output
n/a

Environment:
Linux

Additional context
Comment from DAOS issue daos-stack/libfabric#56:

@frostedcmos , Lei discovered that --disable-opx is required on a cluster with Mellanox / IB for ofi to properly return that verbs is a provider that we can use. this is why i was having issues with using 1.15rc3 on thee io500 cluster. i just updated with ofi build without changing the configure options. I don't know if this is really the intention, but why would one need to explicitly disable opx on an IB cluster and why would it not report that verbs is an available provider? we don't even have opa on those nodes. sounds like a bug to me, but i don't know whether we should hold this patch for it.

Yea this isn't right. Will fix

@timothom64 - Is it possible to provide a fix for this today? v1.15 has already been delayed, and I'd like to publish the release, but I think we'll want a fix for this.

I will try

In CI testing now

Created pr