prov/net: RMA operations can hang when followed immediately by a barrier
ooststep opened this issue · comments
A simple scenario to describe the issue is a 2 rank RMA read test.
- rank 1 is requests an RMA read with rank 0, followed by a barrier
- rank 0 goes directly to the barrier
In this case, rank 1 will expect the RMA read completion acknowledgement but has rank 0's barrier message at the front of the tcp stream, blocking further RMA progress and stalling communication.
There is an initial fix (#8283, #8272 for v1.17) which enables a single 0-byte tagged message to be 'saved' and skipped to allow further processing. this allows the described scenario to succeed.
To allow a larger set of arbitrary out-of-order messaging to succeed, more in-depth and likely performance hampering changes will be required.
Fixed in main, v1.17.x, and net installation branch.