NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.

Home Page:https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Question] ub_tp_comm_overlap config setup

tylaar opened this issue · comments

Hi there,
I found there is very little doc regarding about ub_tp_comm_overlap config during model setup both in TransformerLayer and AttentionLayer, it seems opening up ub_tp_comm_overlap=True will result in "ub_manager not initialized" issue, but there is not too much content about the practise regarding how to call initialize_ub function in the repo. Is there a link or way to tell how to setup this ub_manager for tp comm overlap?

Thanks a lot

Adding @denera for visibility.

I agree that currently there is pretty much zero documentation around UB usage. We are working on making that feature easier to use and document it better.

Adding @denera for visibility.

I agree that currently there is pretty much zero documentation around UB usage. We are working on making that feature easier to use and document it better.

Thanks @ptrendx! Any quick guide/introduction or sample code I could use to do some quick play around?

Okay so during the last few hours I played around a little by setting ub_tp_comm_overlap=True, and things I've overcame listed as below:

  1. GDRCopy seems to be mandatory now (not sure why my previous code piece for UB is using macro to comment them out), so I installed GDRCopy according to their instruction
  2. Then I found my setup.py has a logic which is judging if using "userbuffer_use_c10d_pg" then not to append -DMPI and -DGDR , thus loading up tex at this building version failed since in UB using mpi interface looks to be a must have and will throw something like "undefined symbol ompi_mpi_char" stuff. This is solved by not using c10d_pg, instead directly using MPI + GDR
  3. This is the step that puzzles me now. After performing two steps, once I start to initialize_ub it will throw sth like:
    *** The MPI_Comm_rank() function was called before MPI_INIT was invoked. *** This is disallowed by the MPI standard. *** Your MPI job will now abort. now I have no Idea where the mpi_init is called during booting up TE then

Hi @tylaar -- the NVLink collectives provided by userbuffers code in transformer_engine/pytorch/csrc/userbuffers is bootstrapped with MPI on initialization. PR #760 removes this dependency, but it has not been upstreamed yet. In your case, I'm guessing that the easiest way to initialize MPI before initializing userbuffers is to call torch.distributed.init_proces_group(backend="mpi") at the top of your code.

Hi @denera, thanks for the quick response. I think my situation won't be that easily mitigated by setting the backend to mpi. Previous code stack is mainly using nccl as backend, and when I switch to mpi it shows that current torch is not compiled with MPI and thus need to be rebuilt ...

Also, not sure if changing this could affect other communication performance previously on nccl. Is that safe?
Any other workaround for this situation ?

Many thanks!

Hi @tylaar -- to clarify my earlier suggestion, I did not mean to say you need to switch NCCL backend to MPI for torch.distributed, but rather that you need to initialize both backends. You can do this by first initializing NCCL with torch.distributed.init_process_group(backend="nccl") and then creating an MPI process group with mpi_pg = torch.distributed.new_group(backend="mpi") (or vice versa), assuming PyTorch is compiled with MPI support. This is strictly for triggering an MPI_Init() call somewhere in your execution before you call initialize_ub(...). Ultimately, every collective you have outside of the comm+GEMM overlap that was on NCCL before should still be called on the NCCL backend.

As for the comm+GEMM overlap, the MPI is used strictly just for bootstrapping. The NVLink collectives in the comm+GEMM overlap algorithms execute through CUDA multicast, so you should not see a performance impact there beyond the one-time overhead of creating/splitting MPI communicators when you call initialize_ub(...).

Since your existing PyTorch is not compiled with MPI, it might be easier to use mpi4py instead to force the MPI_Init() call.. You would need to initialize your NCCL process group with

from mpi4py import MPI  # must be imported before torch.distributed
import torch.distributed as dist
dist.init_process_group(backend="nccl", rank=MPI.COMM_WORLD.Get_rank(), size=MPI.COMM_WORLD.Get_size())

and then run your script with

$ mpiexec -np <N> -x MASTER_ADDR=<1.2.3.4> -x MASTER_PORT=<1234> -x PATH python <file.py> <args>

instead of torchrun.

Let me know if that works for you.

Hi @denera , thanks for the quick reply!

I tried to recompile pytorch, but the side-effect of dependencies package recompile is too long for me, so I gave up this way. Then I turned to mpi4py solution you proposed, and here is the new issue I have:

What if I am using torchrun after with arguments like:
--node_rank=0 --nproc_per_node=<N> --nnodes=1 --rdzv_endpoint=10.xx.xx.xx:xxxx
after the mpiexec ? It appears to report error inside torchrun saying "The server socket has failed to bind to [::]:xxxx (errno: 98 - Address already in use)"

Sorry for asking so many questions and thanks again for you guys' help!

Hi @tylaar -- There is no way to use mpiexec and torchrun at the same time, and it's also not necessary anyway because torch.distributed supports launching with mpiexec.

For example, if your torchrun launch is like

$ torchrun --node_rank=0 --nproc_per_node=<N> --nnodes=1 --rdzv_endpoint=<1.2.3.4>:<1234> python <file.py> <args>

the equivalent mpiexec launch would be

$ mpiexec -np <N> -x MASTER_ADDR=<1.2.3.4> -x MASTER_PORT=<1234> -x PATH python <file.py> <args>`

The default behavior for both of these is to launch on a single node so --node_rank=0 and --nnodes=1 are not necessary.

This is the same mpiexec command I shared with you in my previous comment, and I've verified that it works on our compute cluster with rdzv_endpoint address set as MASTER_ADDR and rdzv_endpoint port set as MASTER_PORT.

I was also able to successfully launch multi-node torch.distributed jobs with mpiexec instead of torchrun too. All it took was replacing -np <N> with -hostfile myhosts.txt where myhosts.txt contains a list of network addresses for each node and the # of processes I want to launch on each one. I would recommend consulting MPI documentation for more details. The host file offers a lot of flexibility and customization like controlling what ranges of global ranks map onto which nodes, whether the ranks should be bound to physical cores or oversubscribed, etc.

The only core requirement for torch.distributed is to specify -x MASTER_ADDR=<1.2.3.4> -x MASTER_PORT=<1234> as the address and port for your primary node (usually root node 0) that has passwordless access to every other node. torch.distributed.init_process_group(...) depends on these environment variables to correctly rendezvous across all the processes that call into the initialization, regardless of how those processes have originally been launched.

hey @denera , sorry for the late reply. So i've tried several way to bypass the mpi stuff these days and haven't figured out a good way due to my underlying infra level incompatibility issue (yeah, like pytorch stuff I've encountered before ...). I guess probably I will just wait for your PR get merged and stick to latest version of TE then. I will close this first then. Thanks again for your effort!