aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Clarification needed for NCCL setting recommendation

kwen2501 opened this issue · comments

Hi Team,

Step 8 of this EFA guidance seems to recommend the following NCCL settings:

  • NCCL_ALGO=ring — enables ring algorithm for collective operations.

  • NCCL_PROTO=simple — instructs NCCL to use a simple protocol for communication. Currently, the EFA provider does not support LL protocols. Enabling them could lead to data corruption.

Are they still the recommended settings today? In this recent issue, the recommendation seems to have changed to NOT changing any NCCL defaults: #65 (comment)

Any clarification would be much appreciated!

Specifically, does EFA support LL or LL128 today? Or do users still need to set protocol to Simple to avoid data corruptions?

Thank you!

Cc @rohan-varma @zhaojuanmao @pbelevich

Hey Ke,

Thanks for reaching out.

Yes, we recommend using NCCL_ALGO=ring for large message all reduce as it yields better performance. NCCL_PROTO=simple is a must to ensure data consistency when using EFA as a network provider. EFA doesn't support LL and LL128 variants.

Regarding the comment in #65 (comment), the recommendation was to NOT change the default for NCCL_MIN_NCHANNELS. All of our recommendations from our getting started document applies.

Hope that clarifies.

Thanks for the clarification @rashikakheria. It really helps!