prov/opx: used by default instead of psm2 even though it's "beta"
bartoldeman opened this issue · comments
Describe the bug
The opx provider is default even though it's labelled BETA, psm2 is only used if you disable opx or set FI_PROVIDER=psm2.
If opx is enabled it'll take priority over psm2 even if it's labelled BETA: src/fabric.c
:
char *ordered_prov_names[] = {
"efa", "opx", "psm2", "psm", "usnic", "gni", "bgq", "verbs",
shouldn't this be "efa", "psm2", "opx", "psm", "usnic", "gni", "bgq", "verbs",
instead?
Secondly if you force psm2 via This was fixed in libfabric 1.15.0 as far as I can see, commit 3f1d52d.FI_PROVIDER=psm2
all symbols dynamically linked from libpsm2.so
(e.g. psm2_mq_irecv2
) are duplicated by the psm3 provider inside libfabric.so
, so not taken from libpsm2.so
. As a consequence all communication goes over the ethernet instead of omnipath.
To Reproduce
Steps to reproduce the behavior:
- without
FI_PROVIDER
set run a test withFI_LOG_LEVEL=trace
, and you see opx is used on omnipath.
Expected behavior
If needed, a clear and concise description of what you expected to happen.
- without
FI_PROVIDER
set run a test withFI_LOG_LEVEL=trace
, and you see psm2 is used on omnipath.
Environment:
OS (if not Linux), provider, endpoint type, etc.
$ opainfo
hfi1_0:1 PortGID:0xfe80000000000000:00117501017afb0e
PortState: Active
LinkSpeed Act: 25Gb En: 25Gb
LinkWidth Act: 4 En: 4
LinkWidthDnGrd ActTx: 4 Rx: 4 En: 3,4
LCRC Act: 14-bit En: 14-bit,16-bit,48-bit Mgmt: True
LID: 0x000000a7-0x000000a7 SM LID: 0x00000005 SL: 0
QSFP Copper, 2m Hitachi Metals P/N IQSFP26C-20 Rev 00
Xmit Data: 1200547346 MB Pkts: 255264322799
Recv Data: 1580317452 MB Pkts: 297296790887
Link Quality: 5 (Excellent)
Additional context
Workaround: set FI_PROVIDER=psm2
duplicated by the psm3 provider
What version of libfabric are you using? This issue was corrected a couple of releases ago.
duplicated by the psm3 provider
What version of libfabric are you using? This issue was corrected a couple of releases ago.
yes sorry, I conflated two issues. The issue with symbols occurs with 1.12.1 but not with 1.15.1
Still defaulting to opx over psm2 is surpising so I'll edit and leave that.
I'll look into this. Can you assign this to me?
Created PR
This is probably fixed/closed
@shefty another one that I think has been fixed
@timothom64 - Do you need write access for the opx provider? (and ofiwg more broadly)