ofiwg / libfabric

Open Fabric Interfaces

Home Page:http://libfabric.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

prov/opx: used by default instead of psm2 even though it's "beta"

bartoldeman opened this issue · comments

Describe the bug
The opx provider is default even though it's labelled BETA, psm2 is only used if you disable opx or set FI_PROVIDER=psm2.

If opx is enabled it'll take priority over psm2 even if it's labelled BETA: src/fabric.c:

        char *ordered_prov_names[] = {
                "efa", "opx", "psm2", "psm", "usnic", "gni", "bgq", "verbs",

shouldn't this be "efa", "psm2", "opx", "psm", "usnic", "gni", "bgq", "verbs", instead?

Secondly if you force psm2 via FI_PROVIDER=psm2 all symbols dynamically linked from libpsm2.so (e.g. psm2_mq_irecv2) are duplicated by the psm3 provider inside libfabric.so, so not taken from libpsm2.so. As a consequence all communication goes over the ethernet instead of omnipath. This was fixed in libfabric 1.15.0 as far as I can see, commit 3f1d52d.

To Reproduce
Steps to reproduce the behavior:

  • without FI_PROVIDER set run a test with FI_LOG_LEVEL=trace, and you see opx is used on omnipath.

Expected behavior
If needed, a clear and concise description of what you expected to happen.

  • without FI_PROVIDER set run a test with FI_LOG_LEVEL=trace, and you see psm2 is used on omnipath.

Environment:
OS (if not Linux), provider, endpoint type, etc.

$ opainfo 
hfi1_0:1                           PortGID:0xfe80000000000000:00117501017afb0e
   PortState:     Active
   LinkSpeed      Act: 25Gb         En: 25Gb        
   LinkWidth      Act: 4            En: 4           
   LinkWidthDnGrd ActTx: 4  Rx: 4   En: 3,4         
   LCRC           Act: 14-bit       En: 14-bit,16-bit,48-bit       Mgmt: True 
   LID: 0x000000a7-0x000000a7       SM LID: 0x00000005 SL: 0 
         QSFP Copper,       2m  Hitachi Metals    P/N IQSFP26C-20       Rev 00
   Xmit Data:         1200547346 MB Pkts:         255264322799
   Recv Data:         1580317452 MB Pkts:         297296790887
   Link Quality: 5 (Excellent)

Additional context

Workaround: set FI_PROVIDER=psm2

duplicated by the psm3 provider
What version of libfabric are you using? This issue was corrected a couple of releases ago.

duplicated by the psm3 provider
What version of libfabric are you using? This issue was corrected a couple of releases ago.

yes sorry, I conflated two issues. The issue with symbols occurs with 1.12.1 but not with 1.15.1

Still defaulting to opx over psm2 is surpising so I'll edit and leave that.

I'll look into this. Can you assign this to me?

Created PR

#7926

This is probably fixed/closed

@shefty another one that I think has been fixed

@timothom64 - Do you need write access for the opx provider? (and ofiwg more broadly)