DGLGraphParallel

PyTorch data parallelism solution for training DGL(Deep Graph Library) models on Single-Machine-Multi-GPUs.

Getting Started

Since DGL doesn't provide simple APIs for training DGL in parallel, this module provides an unofficial way to train GCN-like modules on multi-GPUs with PyTorch backend.

Prerequisites

PyTorch 1.2.x (cuda enabled)
DGL 0.3.x

APIs

class DGLNodeFlowLoader

This class generates multi-nodeflows according to the cuda device number. The genereated nodeflows will be gathered into a list as the input data.

Currently it only supports NeighborSampler method.

class DGLGraphDataParallel(torch.nn.Module)

Similar to torch.nn.DataParallel, this class automatically replicates the model across GPUs and scatters the inputs data (generated by DGLNodeFlowLoader) to corresponding GPUs.

Run

See the examples of gcn_ns_dp.py in folder examples.

To run the example with nn.DataParallel API, instruction can be:

$ cp DGLGraphParallel/examples/gcn_ns_dp.py ./
$ DGLBACKEND=pytorch python gcn_ns_dp.py --gpu 0,1,2 --dataset reddit-self-loop --num-neighbors 10 --batch-size 30000 --test-batch-size 30000

To run the example with nn.DistributedDataParallel API, instruction can be:

(in bash 1)
$ python examples/distributed/run_graph_server.py --dataset reddit-self-loop --num-workers 3

(in bash 2, after bash 1 starts graph server)
$ DGLBACKEND=pytorch python examples/distributed/gcn_ns_ddp.py --gpu 0,1 --dataset reddit-self-loop --num-neighbors 10 --batch-size 30000 --test-batch-size 10000

Implementation Details

class DGLNodeFlowLoader

It will generate same number of nodeflows as the torch.cuda.device_count() and gather them into a list for inputs. Also the labels will be returned as the concatenation of all labels corresponded to nodeflows in the inputs.

class DGLGraphDataParallel

This class is modified from torch.nn.DataParallel. The input of the model should be a list of nodeflows.

Each forward will perform following operations (similar to torch.nn.DataParallel):

Scatter inputs and kwargs to all GPUs (by using NodeFlow.copy_from_parent())
Replicate the module to all GPUs (same as torch)
Parallel apply for all GPUs (same as torch)
Gather forwarding results back to one GPU (same as torch)

Therefore, DGLGraphDataParallel will transmit datas (nodeflows), weights, forwarding results at every single forwarding. The backward (gradient generation and weights update) will be only applied on one GPU.

Also we can leverage PyTorch NCCL backend (which only transfers gradient under ring-allreduced pattern) and DGL Graph Store (stores the graph data in shared memory) to implement single machine multi gpus training. A demo can be found in examples/distributed/.

Measurements

All the following results are measured on three GTX 1080 GPUs using nn.DataParallel API and examples/gcn_ns_dp.py with 2 GCN Layers and 10 sampling neighbors. The batch size is 30000.

Strong Scalability Test (each model replica's batch size = 30000)

+------+---------------+  
| GPUs | Epoch Time(s) |  
+------+---------------+  
|  1   | 3.6018        |  
+------+---------------+  
|  2   | 4.8981        |  
+------+---------------+  
|  3   | 4.1862        |  
+------+---------------+

Weak Scalability Test (each model replica's batch size = 30000 / GPUs)

+------+---------------+
| GPUs | Epoch Time(s) |
+------+---------------+
|  1   | 3.6018        |
+------+---------------+
|  2   | 3.9829        |
+------+---------------+
|  3   | 4.5653        |
+------+---------------+

RunxinXu / DGLGraphParallel