ofiwg / libfabric

Open Fabric Interfaces

Home Page:http://libfabric.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

prov/net: RMA operations can hang when followed immediately by a barrier

ooststep opened this issue · comments

A simple scenario to describe the issue is a 2 rank RMA read test.

  • rank 1 is requests an RMA read with rank 0, followed by a barrier
  • rank 0 goes directly to the barrier

In this case, rank 1 will expect the RMA read completion acknowledgement but has rank 0's barrier message at the front of the tcp stream, blocking further RMA progress and stalling communication.

There is an initial fix (#8283, #8272 for v1.17) which enables a single 0-byte tagged message to be 'saved' and skipped to allow further processing. this allows the described scenario to succeed.

To allow a larger set of arbitrary out-of-order messaging to succeed, more in-depth and likely performance hampering changes will be required.

Fixed in main, v1.17.x, and net installation branch.