aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

aws-ofi-nccl makes unnecessary calls to ofi_iflush() when using the PSM3 transport.

mwheinz opened this issue · comments

On some hardware, even a simple tensorflow test can end up calling ofi_iflush() tens of thousands of times per rank this serves no benefit since PSM3 ensures that GPU buffers are kept in sync after each I/O. In addition, because ofi_iflush() calls ofi_nccl_gdr_flush_disable() on every invocation, and ofi_nccl_gdr_flush_disable() acquires a mutex on each invocation, this adds further drag on performance.

Could you please re-base your patch on current master and verify at your end before we do another round of review?

I'm dropping this for now. We have no evidence that it's causing an actual problem and I've been redirected to other tasks.