Multi-node Training with DDP stuck at "Initialize distributed..." on SLURM cluster

Question

Multi-node Training with DDP stuck at "Initialize distributed..." on SLURM cluster

OswaldHe opened this issue 5 months ago · comments

Bug description

I'm working on a slurm cluster with 8 AMD MI100 GPUs distributed in 2 nodes, with 4 GPUs in each node. I follow the instructions (https://lightning.ai/docs/pytorch/stable/clouds/cluster_advanced.html) to submit a multi-node training job, but the job stuck at "Initializing distributed:...". I checked all related issues and none of them solve the problem.

What version are you seeing the problem on?

v2.2

How to reproduce the bug

Training Script:

import os
from torch import optim, nn, utils, Tensor
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
import lightning as L

# define any number of nn.Modules (or use your current ones)
encoder = nn.Sequential(nn.Linear(28 * 28, 64), nn.ReLU(), nn.Linear(64, 3))
decoder = nn.Sequential(nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 28 * 28))


# define the LightningModule
class LitAutoEncoder(L.LightningModule):
    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder

    def training_step(self, batch, batch_idx):
        # training_step defines the train loop.
        # it is independent of forward
        x, y = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = nn.functional.mse_loss(x_hat, x)
        # Logging to TensorBoard (if installed) by default
        self.log("train_loss", loss)
        return loss

    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters(), lr=1e-3)
        return optimizer


# init the autoencoder
autoencoder = LitAutoEncoder(encoder, decoder)

# setup data
dataset = MNIST(os.getcwd(), download=True, transform=ToTensor())
train_loader = utils.data.DataLoader(dataset)

# train the model (hint: here are some helpful Trainer arguments for rapid idea iteration)
trainer = L.Trainer(limit_train_batches=100, max_epochs=1, num_nodes=2, devices=4, strategy="ddp")
trainer.fit(model=autoencoder, train_dataloaders=train_loader)

SLURM batch script:

#!/bin/bash

#SBATCH -p mi1004x
#SBATCH --nodes=2             # This needs to match Trainer(num_nodes=...)
#SBATCH --ntasks-per-node=4   # This needs to match Trainer(devices=...)
#SBATCH --time=0-00:30:00
#SBATCH -e slurm-%j.err

source ~/miniconda3/bin/activate pylight
# run script from above
srun python train.py

Error messages and logs

Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
You are using a CUDA device ('AMD Instinct MI100') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8

Environment

Current environment

CUDA:
- GPU:
  - AMD Instinct MI100
  - AMD Instinct MI100
  - AMD Instinct MI100
  - AMD Instinct MI100
- available: True
- version: None
Lightning:
- lightning: 2.2.1
- lightning-utilities: 0.11.2
- pytorch-lightning: 2.2.1
- pytorch-triton-rocm: 2.2.0
- torch: 2.2.0+rocm5.6
- torchaudio: 2.2.0+rocm5.6
- torchmetrics: 1.3.2
- torchvision: 0.17.0+rocm5.6
Packages:
- absl-py: 2.1.0
- aiohttp: 3.9.3
- aiosignal: 1.3.1
- annotated-types: 0.6.0
- async-timeout: 4.0.3
- attrs: 23.2.0
- certifi: 2022.12.7
- charset-normalizer: 2.1.1
- deepspeed: 0.14.0
- filelock: 3.9.0
- frozenlist: 1.4.1
- fsspec: 2023.4.0
- future: 1.0.0
- grpcio: 1.62.1
- hjson: 3.1.0
- idna: 3.4
- imageio: 2.34.0
- jinja2: 3.1.2
- lightning: 2.2.1
- lightning-utilities: 0.11.2
- markdown: 3.6
- markupsafe: 2.1.3
- mpmath: 1.3.0
- multidict: 6.0.5
- networkx: 3.2.1
- ninja: 1.11.1.1
- numpy: 1.26.3
- packaging: 24.0
- pandas: 2.2.1
- pillow: 10.2.0
- pip: 23.3.1
- protobuf: 5.26.1
- psutil: 5.9.8
- py-cpuinfo: 9.0.0
- pydantic: 2.7.0
- pydantic-core: 2.18.1
- pynvml: 11.5.0
- python-dateutil: 2.9.0.post0
- pytorch-lightning: 2.2.1
- pytorch-triton-rocm: 2.2.0
- pytz: 2024.1
- pyyaml: 6.0.1
- requests: 2.28.1
- setuptools: 68.2.2
- six: 1.16.0
- sympy: 1.12
- tensorboard: 2.16.2
- tensorboard-data-server: 0.7.2
- test-tube: 0.7.5
- torch: 2.2.0+rocm5.6
- torchaudio: 2.2.0+rocm5.6
- torchmetrics: 1.3.2
- torchvision: 0.17.0+rocm5.6
- tqdm: 4.66.2
- typing-extensions: 4.8.0
- tzdata: 2024.1
- urllib3: 1.26.13
- werkzeug: 3.0.1
- wheel: 0.41.2
- yarl: 1.9.4
System:
- OS: Linux
- architecture:
  - 64bit
  - ELF
- processor: x86_64
- python: 3.10.14
- release: 5.14.0-162.18.1.el9_1.x86_64
- version: SMP PREEMPT_DYNAMIC Wed Mar 1 22:02:24 UTC 2023

More info

No response

Jaydeep Rade · Answer 1 · Wed May 01 2024 00:51:53 GMT+0800 (China Standard Time)

Try using "srun python3 train.py". python --> python3

Oswald(Zifan) He · Answer 2 · Wed May 01 2024 01:24:24 GMT+0800 (China Standard Time)

I tried python3, but the issue still remains.

FelixBrakel · Answer 3 · Wed May 01 2024 20:20:16 GMT+0800 (China Standard Time)

I have the same issue. It works fine when using srun but as a job submitted with sbatch it hangs.