ofiwg / libfabric

Open Fabric Interfaces

Home Page:http://libfabric.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

prov/shm: shm provider claims FI_HMEM cap when there is no hmem ifaces initialized

shijin-aws opened this issue · comments

Currently, shm provider always support FI_HMEM caps https://github.com/ofiwg/libfabric/blob/main/prov/shm/src/smr_attr.c#L161, even when there is no hmem ifaces initialized on the system.

Is this by design? It exposed an issue when running Open MPI 5 with Libfabric 1.20.x, which has this PR #9404 to add FI_ATOMIC support for FI_HMEM in shm provider.

Open MPI 5's btl/ofi component has a special logic https://github.com/open-mpi/ompi/blob/bd33b994e1d09ab71e9bb0b66c9661623fe742a3/opal/mca/btl/ofi/btl_ofi_component.c#L356-L385 to call fi_getinfo with FI_HMEM and the FI_RMA & FI_ATOMICS first. If the call with FI_HMEM failed, it will retry with a getinfo without FI_HMEM. On the contrary, EFA provider won't claim FI_HMEM support unless at least 1 hmem iface is initialized.

This results in an issue that in the first FI_HMEM fi_getinfo call, only shm provider succeeded, but it is excluded by Open MPI by default: https://github.com/open-mpi/ompi/blob/bd33b994e1d09ab71e9bb0b66c9661623fe742a3/opal/mca/common/ofi/common_ofi.c#L44. Then Open MPI doesn't move on for the non-FI_HMEM fi_getinfo call and close the whole ofi component.

To me Open MPI 5 needs to do a better job in handling the FI_HMEM/non-FI_HMEM error path, but I also think it doesn't make sense to claim FI_HMEM unconditionally for shm provider.

@aingerson what do you think?

@shijin-aws Hmm that's a tricky one. I definitely think OMPI needs to do a better job of querying only for the things it needs. Even if we change the behavior, OMPI runs the risk at hurting performance by enabling things it doesn't need. For example, let's say HMEM is available on the system and it uses the shm provider with HMEM even though it doesn't need it. This makes shm disable CMA (which we plan on fixing so that's temporary but still an issue) as well as start a listener thread for the ZE IPC implementation. There could be other cases for the same thing. Things that are requested by the app could result in performance penalties (there are others like FI_SOURCE for example). I think it is unwise for OMPI to always try to get HMEM support.
While I think there is an argument to returning enodata if no iface is available, my inclination is to say that it should still return the hmem fi_info because FI_HMEM just means the provider can handle hmem if the user uses it. If no iface is available on the system, then the user won't ever be able to pass hmem into the provider at all. The issue is not the the provider cannot support what the application gives it but rather the application won't be able to do that in the first place.
The only thing I can think of is in the case where OMPI can use an hmem iface and OFI fails initialization or something. But that seems like a bigger failure all together.

Short answer is I'm torn in regards to the proper behavior for shm but I feel like OMPI shouldn't be passing FI_HMEM in when it doesn't need it.

For example, let's say HMEM is available on the system and it uses the shm provider with HMEM even though it doesn't need it. This makes shm disable CMA (which we plan on fixing so that's temporary but still an issue) as well as start a listener thread for the ZE IPC implementation. There could be other cases for the same thing. Things that are requested by the app could result in performance penalties (there are others like FI_SOURCE for example). I think it is unwise for OMPI to always try to get HMEM support.

I agree with you on this, on AWS's p4/p5 series instances (with nvidia GPUs), if application run with Open MPI 5, it will use FI_HMEM and hurts the host-to-host traffic, even if application doesn't need any GPU support from Ompi. Maybe not as many people is gonna do that on cloud (use GPU instances to run purely host workloads).

my inclination is to say that it should still return the hmem fi_info because FI_HMEM just means the provider can handle hmem if the user uses it.

I didn't see man page has a clear description on this FI_HMEM: Specifies that the endpoint should support transfers to and from device memory

Do other providers like tcp and verbs claim FI_HMEM support when there are no hmem ifaces on the system? @j-xiong @shefty

tcp doesn't support FI_HMEM but verbs does and behaves the same as shm (will return FI_HMEM support even if no interfaces are available)

verbs hmem support doesn't depends on the initialization of the HMEM interfaces. It is on as long as the kernel RDMA system supports device memory (via dmabuf or MOFED peer-mem).

@j-xiong verbs doesn't support copy (non-rdma) mode for FI_HMEM?

@shijin-aws No. The verbs provider tries to get as close to IB verbs as possible and doesn't introduce extra protocols. The bounce buffer support is provided by rxm via the eager and SAR protocols. The rendezvous protocol is not buffered.

@j-xiong @aingerson Thanks both. I will close this issue as it's an expected behavior. We will evaluate whether efa provider need to follow this.