Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.

Home Page:https://lightning.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Multi-node Training with DDP stuck at "Initialize distributed..." on SLURM cluster

OswaldHe opened this issue · comments

Bug description

I'm working on a slurm cluster with 8 AMD MI100 GPUs distributed in 2 nodes, with 4 GPUs in each node. I follow the instructions (https://lightning.ai/docs/pytorch/stable/clouds/cluster_advanced.html) to submit a multi-node training job, but the job stuck at "Initializing distributed:...". I checked all related issues and none of them solve the problem.

What version are you seeing the problem on?

v2.2

How to reproduce the bug

Training Script:

import os
from torch import optim, nn, utils, Tensor
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
import lightning as L

# define any number of nn.Modules (or use your current ones)
encoder = nn.Sequential(nn.Linear(28 * 28, 64), nn.ReLU(), nn.Linear(64, 3))
decoder = nn.Sequential(nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 28 * 28))


# define the LightningModule
class LitAutoEncoder(L.LightningModule):
    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder

    def training_step(self, batch, batch_idx):
        # training_step defines the train loop.
        # it is independent of forward
        x, y = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = nn.functional.mse_loss(x_hat, x)
        # Logging to TensorBoard (if installed) by default
        self.log("train_loss", loss)
        return loss

    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters(), lr=1e-3)
        return optimizer


# init the autoencoder
autoencoder = LitAutoEncoder(encoder, decoder)

# setup data
dataset = MNIST(os.getcwd(), download=True, transform=ToTensor())
train_loader = utils.data.DataLoader(dataset)

# train the model (hint: here are some helpful Trainer arguments for rapid idea iteration)
trainer = L.Trainer(limit_train_batches=100, max_epochs=1, num_nodes=2, devices=4, strategy="ddp")
trainer.fit(model=autoencoder, train_dataloaders=train_loader)

SLURM batch script:

#!/bin/bash

#SBATCH -p mi1004x
#SBATCH --nodes=2             # This needs to match Trainer(num_nodes=...)
#SBATCH --ntasks-per-node=4   # This needs to match Trainer(devices=...)
#SBATCH --time=0-00:30:00
#SBATCH -e slurm-%j.err

source ~/miniconda3/bin/activate pylight
# run script from above
srun python train.py

Error messages and logs

Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
You are using a CUDA device ('AMD Instinct MI100') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8

Environment

Current environment
  • CUDA:
    • GPU:
      • AMD Instinct MI100
      • AMD Instinct MI100
      • AMD Instinct MI100
      • AMD Instinct MI100
    • available: True
    • version: None
  • Lightning:
    • lightning: 2.2.1
    • lightning-utilities: 0.11.2
    • pytorch-lightning: 2.2.1
    • pytorch-triton-rocm: 2.2.0
    • torch: 2.2.0+rocm5.6
    • torchaudio: 2.2.0+rocm5.6
    • torchmetrics: 1.3.2
    • torchvision: 0.17.0+rocm5.6
  • Packages:
    • absl-py: 2.1.0
    • aiohttp: 3.9.3
    • aiosignal: 1.3.1
    • annotated-types: 0.6.0
    • async-timeout: 4.0.3
    • attrs: 23.2.0
    • certifi: 2022.12.7
    • charset-normalizer: 2.1.1
    • deepspeed: 0.14.0
    • filelock: 3.9.0
    • frozenlist: 1.4.1
    • fsspec: 2023.4.0
    • future: 1.0.0
    • grpcio: 1.62.1
    • hjson: 3.1.0
    • idna: 3.4
    • imageio: 2.34.0
    • jinja2: 3.1.2
    • lightning: 2.2.1
    • lightning-utilities: 0.11.2
    • markdown: 3.6
    • markupsafe: 2.1.3
    • mpmath: 1.3.0
    • multidict: 6.0.5
    • networkx: 3.2.1
    • ninja: 1.11.1.1
    • numpy: 1.26.3
    • packaging: 24.0
    • pandas: 2.2.1
    • pillow: 10.2.0
    • pip: 23.3.1
    • protobuf: 5.26.1
    • psutil: 5.9.8
    • py-cpuinfo: 9.0.0
    • pydantic: 2.7.0
    • pydantic-core: 2.18.1
    • pynvml: 11.5.0
    • python-dateutil: 2.9.0.post0
    • pytorch-lightning: 2.2.1
    • pytorch-triton-rocm: 2.2.0
    • pytz: 2024.1
    • pyyaml: 6.0.1
    • requests: 2.28.1
    • setuptools: 68.2.2
    • six: 1.16.0
    • sympy: 1.12
    • tensorboard: 2.16.2
    • tensorboard-data-server: 0.7.2
    • test-tube: 0.7.5
    • torch: 2.2.0+rocm5.6
    • torchaudio: 2.2.0+rocm5.6
    • torchmetrics: 1.3.2
    • torchvision: 0.17.0+rocm5.6
    • tqdm: 4.66.2
    • typing-extensions: 4.8.0
    • tzdata: 2024.1
    • urllib3: 1.26.13
    • werkzeug: 3.0.1
    • wheel: 0.41.2
    • yarl: 1.9.4
  • System:
    • OS: Linux
    • architecture:
      • 64bit
      • ELF
    • processor: x86_64
    • python: 3.10.14
    • release: 5.14.0-162.18.1.el9_1.x86_64
    • version: SMP PREEMPT_DYNAMIC Wed Mar 1 22:02:24 UTC 2023

More info

No response

Try using "srun python3 train.py". python --> python3

I tried python3, but the issue still remains.

I have the same issue. It works fine when using srun but as a job submitted with sbatch it hangs.