Clarification needed for NCCL setting recommendation
kwen2501 opened this issue · comments
Hi Team,
Step 8 of this EFA guidance seems to recommend the following NCCL settings:
-
NCCL_ALGO=ring
— enables ring algorithm for collective operations. -
NCCL_PROTO=simple
— instructs NCCL to use a simple protocol for communication. Currently, the EFA provider does not support LL protocols. Enabling them could lead to data corruption.
Are they still the recommended settings today? In this recent issue, the recommendation seems to have changed to NOT changing any NCCL defaults: #65 (comment)
Any clarification would be much appreciated!
Specifically, does EFA support LL or LL128 today? Or do users still need to set protocol to Simple to avoid data corruptions?
Thank you!
Hey Ke,
Thanks for reaching out.
Yes, we recommend using NCCL_ALGO=ring
for large message all reduce as it yields better performance. NCCL_PROTO=simple
is a must to ensure data consistency when using EFA as a network provider. EFA doesn't support LL and LL128 variants.
Regarding the comment in #65 (comment), the recommendation was to NOT change the default for NCCL_MIN_NCHANNELS
. All of our recommendations from our getting started document applies.
Hope that clarifies.
Thanks for the clarification @rashikakheria. It really helps!