aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

GPU direct

tks2004 opened this issue · comments

If we need to enable GPU direct, is there any FI environment to be enabled to utilize that feature.

Applications request GPU direct capability from Libfabric by adding the FI_HMEM flag when calling fi_getinfo, as the plugin does here:

hints->caps = FI_TAGGED | FI_MSG | FI_HMEM;

Before Libfabric 1.18, the Libfabric EFA provider also required an environment variable, FI_EFA_USE_DEVICE_RDMA=1, to enable GPU direct. For Libfabric 1.18+ and Aws-ofi-nccl 1.7.0+, this is no longer required. See also: https://github.com/aws/aws-ofi-nccl/blob/master/doc/efa-env-var.md, mostly relevant to EFA provider.