aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Handling GDR-capable providers not requested for memory registration

dmaryin opened this issue · comments

Hello aws_ofi_nccl maintainers,

For GDR-capable providers which do not request memory registration (i.e. provide FI_HMEM but not FI_MR_HMEM or FI_MR_LOCAL), there is an issue in current implementation.

The function register_mr_buffers() keeps mr_handle intact, and as it is set to NULL by caller, mr_handle will be NULL.
Later during ofi_iflush(), NULL mr_handle are being treat as an error condition, which leads to returning ncclSystemError. But having NULL here is normal if provider did not ask for memory registration.
Later during ofi_closeRecv() mr_handle is being passed to fi_close(), and in this case it is NULL, this lead to segfault.

To fix this the following pull request was created #81
Please consider merging it.

BRs,
Denis

Thank you for contributing! Could you let us know about provider against which this PR has been tested?

Sure. We tested against PSM3 provider.

Merged the changes. Resolving issue.