How does ofi_iflush() work?

Question

How does ofi_iflush() work?

mwheinz opened this issue 2 years ago · comments

Hi. I'm a code monkey on the PSM3 team and I'm investigating an internal problem report and trying to understand if it's a real issue or not. It relates to ofi_iflush() issuing an RDMA read to, presumably, somehow force all outstanding I/Os to complete. It seems pretty obvious that the data being read is not, itself, important, and I'm wondering how adding another I/O to the queue guarantees that all outstanding I/Os complete?

Could you shed some light on this for me, please? Does NCCL assume that when the RDMA read completes all prior I/Os have also completed?

Rashika Kheria · Answer 1 · Tue Aug 16 2022 01:32:09 GMT+0800 (China Standard Time)

Hello,

GPUDirect RDMA enables third-party PCIe devices like RDMA NICs (EFA or PSM3) to directly send to/ receive from GPU buffers. That means that on the receive side, NCCL's host proxy thread can post receive buffers and get notified when network transfers complete. Now, since GPU follows a relaxed memory model, NIC adapters can't guarantee that the data has been received by the GPU (i.e. writes initiated by the NIC is visible to all running kernels) via a network completion.

A read to a GPU memory (as done by ofi_iflush) forces these writes to be committed following a PCIe read/write ordering semantic. This ensures that it is now safe from NCCL host proxy thread to trigger running kernels to start computation on the received data.

Michael Heinz · Answer 2 · Tue Aug 16 2022 01:34:53 GMT+0800 (China Standard Time)

But calling fi_read() isn't reading from the GPU? It's doing an RDMA read across PSM3 and the network?

Rashika Kheria · Answer 3 · Tue Aug 16 2022 01:43:02 GMT+0800 (China Standard Time)

Yes, this is to ensure that we follow the same PCIe route as the writes from NIC to GPU. The NIC can identify such reads to be local reads and avoid using network?

Michael Heinz · Answer 4 · Tue Aug 16 2022 01:53:04 GMT+0800 (China Standard Time)

That makes sense - but since the plugin doesn't allocate a memory region when using PSM3 (because it doesn't need one) the fi_read() command is silently failing - PSM3 ultimately returns EINVAL, so the fi_read() has no effect. I'm trying to understand if this is problem - applications appear to run successfully, but, on some machines, if NCCL_DEBUG=info you will see a series of warnings about failed fi_read() calls.

James Dinan · Answer 5 · Tue Aug 16 2022 02:36:34 GMT+0800 (China Standard Time)

Yes, this is potentially a problem if PSM3 is performing GPUDirect RDMA for the reasons Rashika mentioned above. Without this read, the NCCL kernel can see inconsistent data in GPU memory. If the fi_read implementation of iflush isn't a good solution for PSM3, you could try the CUDA flush support that was added in the last release. I would recommend testing this thoroughly, as we have seen it deadlock with some networking stacks.

Michael Heinz · Answer 6 · Tue Aug 16 2022 03:49:08 GMT+0800 (China Standard Time)

you could try the CUDA flush support that was added in the last release. I would recommend testing this thoroughly, as we have seen it deadlock with some networking stacks.

Ah. That was going to be my next question because enabling cudaDeviceFlushGPUDirectRDMAWrites() does seem to resolve the issue on the RHEL systems I tested with. Thanks for clarifying.