Distributed PyTorch Lightning Training on Ray

This library adds new PyTorch Lightning accelerators for distributed training using the Ray distributed computing framework.

These PyTorch Lightning Accelerators on Ray enable quick and easy parallel training while still leveraging all the benefits of PyTorch Lightning and using your desired training protocol, either PyTorch Distributed Data Parallel or Horovod.

Once you add your accelerator to the PyTorch Lightning Trainer, you can parallelize training to all the cores in your laptop, or across a massive multi-node, multi-GPU cluster with no additional code changes.

This library also comes with an integration with Ray Tune for distributed hyperparameter tuning experiments.

Installation

You can install the master branch of ray_lightning_accelerators like so:

pip install git+https://github.com/ray-project/ray_lightning_accelerators#ray_lightning

PyTorch Distributed Data Parallel Accelerator on Ray

The RayAccelerator provides Distributed Data Parallel training on a Ray cluster. PyTorch DDP is used as the distributed training protocol, and Ray is used to launch and manage the training worker processes.

Here is a simplified example:

import pytorch_lightning as ptl
from ray_lightning import RayAccelerator

# Create your PyTorch Lightning model here.
ptl_model = MNISTClassifier(...)
accelerator = RayAccelerator(num_workers=4, cpus_per_worker=1, use_gpu=True)

# If using GPUs, set the ``gpus`` arg to a value > 0.
# The actual number of GPUs is determined by ``num_workers``.
trainer = pl.Trainer(..., gpus=1, accelerator=accelerator)
trainer.fit(ptl_model)

Because Ray is used to launch processes, instead of the same script being called multiple times, you CAN use this accelerator even in cases when you cannot use the standard DDPAccelerator such as

Jupyter Notebooks, Google Colab, Kaggle
Calling fit or test multiple times in the same script

Horovod Accelerator on Ray

Or if you prefer to use Horovod as the distributed training protocol, use the HorovodRayAccelerator instead.

import pytorch_lightning as ptl
from ray.util.lightning_accelerators import HorovodRayAccelerator

# Create your PyTorch Lightning model here.
ptl_model = MNISTClassifier(...)

# 2 nodes, 4 workers per node, each using 1 CPU and 1 GPU.
accelerator = HorovodRayAccelerator(num_hosts=2, num_slots=4, use_gpu=True)

# If using GPUs, set the ``gpus`` arg to a value > 0.
# The actual number of GPUs is determined by ``num_slots``.
trainer = pl.Trainer(..., gpus=1, accelerator=accelerator)
trainer.fit(ptl_model)

Multi-node Distributed Training

Using the same examples above, you can run distributed training on a multi-node cluster with just 2 simple steps.

Use Ray's cluster launcher to start a Ray cluster- ray up my_cluster_config.yaml.
Execute your Python script on the Ray cluster- ray submit my_cluster_config.yaml train.py. This will rsync your training script to the head node, and execute it on the Ray cluster.

You no longer have to set environment variables or configurations and run your training script on every single node.

Hyperparameter Tuning with Ray Tune

ray_lightning also integrates with Ray Tune to provide distributed hyperparameter tuning for your distributed model training. You can run multiple PyTorch Lightning training runs in parallel, each with a different hyperparameter configuration, and each training run parallelized by itself. All you have to do is move your training code to a function, pass the function to tune.run, and make sure to add the appropriate callback (Either TuneReportCallback or TuneReportCheckpointCallback) to your PyTorch Lightning Trainer.

Example using ray_lightning with Tune:

def train_mnist(config):
    
    # Create your PTL model.
    model = MNISTClassifier(config)

    # Create the Tune Reporting Callback
    metrics = {"loss": "ptl/val_loss", "acc": "ptl/val_accuracy"}
    callbacks = [TuneReportCallback(metrics, on="validation_end")]
    
    trainer = pl.Trainer(
        max_epochs=4,
        callbacks=callbacks,
        accelerator=RayAccelerator(num_workers=4, use_gpu=False))
    trainer.fit(model)
    
config = {
    "layer_1": tune.choice([32, 64, 128]),
    "layer_2": tune.choice([64, 128, 256]),
    "lr": tune.loguniform(1e-4, 1e-1),
    "batch_size": tune.choice([32, 64, 128]),
}

# Make sure to specify how many actors each training run will create via the "extra_cpu" field.
analysis = tune.run(
        train_mnist,
        metric="loss",
        mode="min",
        config=config,
        num_samples=num_samples,
        resources_per_trial={
            "cpu": 1,
            "extra_cpu": 4
        },
        name="tune_mnist")
        
print("Best hyperparameters found were: ", analysis.best_config)

richardliaw / ray_lightning_accelerators