NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.

Home Page:https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MPI Dependency for Computation-Communication Overlapping in Tensor Parallelism

zhipeng93 opened this issue · comments

Hi,

I've noticed that you have implemented that allows for the overlapping of computation and communication in tensor parallel operations. This is a significant enhancement that has the potential to increase efficiency in distributed training workflows.

However, during my attempts to deploy jobs using torchrun on a Kubernetes (k8s) cluster, I encountered an issue where this overlapping feature does not operate as expected. It appears that the current implementation has a dependency on MPI for certain initialization procedures, which may not be fully compatible with the k8s environment.

Given the growing trend of containerized deployments and the adoption of Kubernetes for distributed jobs, I was wondering if there are any plans to abstract away or remove the MPI dependency for this feature.

Thanks!

@denera is currently working on lifting the MPI requirement for that overlap.