In-Place Reduction for NCCL
cat-state opened this issue · comments
NCCL supports all-reduce in place, however Comm::all_reduce
takes in a &CudaSlice
to read from and a &mut CudaSlice
to write into, which doesn't allow in-place reduction.
Ah yeah I see that (cuda docs: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/colls.html#c.ncclReduce)
I think in this case due to rust's borrow rules it'd probably be easiest to just add Comm::all_reduce_in_place
that takes a &mut CudaSlice
. Fairly easy add if anyone wants to contribute a PR for this! Otherwise I can add later this week