MPI Dependency for Computation-Communication Overlapping in Tensor Parallelism

Question

MPI Dependency for Computation-Communication Overlapping in Tensor Parallelism

zhipeng93 opened this issue 3 months ago · comments

Hi,

I've noticed that you have implemented that allows for the overlapping of computation and communication in tensor parallel operations. This is a significant enhancement that has the potential to increase efficiency in distributed training workflows.

However, during my attempts to deploy jobs using torchrun on a Kubernetes (k8s) cluster, I encountered an issue where this overlapping feature does not operate as expected. It appears that the current implementation has a dependency on MPI for certain initialization procedures, which may not be fully compatible with the k8s environment.

Given the growing trend of containerized deployments and the adoption of Kubernetes for distributed jobs, I was wondering if there are any plans to abstract away or remove the MPI dependency for this feature.

Thanks!

Przemyslaw Tredak · Answer 1 · Tue Mar 26 2024 10:25:01 GMT+0800 (China Standard Time)

@denera is currently working on lifting the MPI requirement for that overlap.