PyTorch data parallelism solution for training DGL(Deep Graph Library) models on Single-Machine-Multi-GPUs.
Since DGL doesn't provide simple APIs for training DGL in parallel, this module provides an unofficial way to train GCN-like modules on multi-GPUs with PyTorch backend.
-
PyTorch 1.2.x (cuda enabled)
-
DGL 0.3.x
class DGLNodeFlowLoader
This class generates multi-nodeflows according to the cuda device number. The genereated nodeflows will be gathered into a list as the input data.
Currently it only supports NeighborSampler
method.
class DGLGraphDataParallel(torch.nn.Module)
Similar to torch.nn.DataParallel
, this class automatically replicates the model across GPUs and scatters the inputs data (generated by DGLNodeFlowLoader
) to corresponding GPUs.
See the examples of gcn_ns_dp.py
in folder examples
.
To run the example with nn.DataParallel
API, instruction can be:
$ cp DGLGraphParallel/examples/gcn_ns_dp.py ./
$ DGLBACKEND=pytorch python gcn_ns_dp.py --gpu 0,1,2 --dataset reddit-self-loop --num-neighbors 10 --batch-size 30000 --test-batch-size 30000
To run the example with nn.DistributedDataParallel
API, instruction can be:
(in bash 1)
$ python examples/distributed/run_graph_server.py --dataset reddit-self-loop --num-workers 3
(in bash 2, after bash 1 starts graph server)
$ DGLBACKEND=pytorch python examples/distributed/gcn_ns_ddp.py --gpu 0,1 --dataset reddit-self-loop --num-neighbors 10 --batch-size 30000 --test-batch-size 10000
class DGLNodeFlowLoader
It will generate same number of nodeflows as the torch.cuda.device_count()
and gather them into a list for inputs. Also the labels
will be returned as the concatenation of all labels corresponded to nodeflows in the inputs.
class DGLGraphDataParallel
This class is modified from torch.nn.DataParallel
. The input of the model should be a list of nodeflows.
Each forward
will perform following operations (similar to torch.nn.DataParallel
):
-
Scatter
inputs
andkwargs
to all GPUs (by usingNodeFlow.copy_from_parent()
) -
Replicate the module to all GPUs (same as
torch
) -
Parallel apply for all GPUs (same as
torch
) -
Gather forwarding results back to one GPU (same as
torch
)
Therefore, DGLGraphDataParallel
will transmit datas (nodeflows), weights, forwarding results at every single forwarding. The backward (gradient generation and weights update) will be only applied on one GPU.
Also we can leverage PyTorch NCCL backend (which only transfers gradient under ring-allreduced pattern) and DGL Graph Store (stores the graph data in shared memory) to implement single machine multi gpus training. A demo can be found in examples/distributed/
.
All the following results are measured on three GTX 1080 GPUs using nn.DataParallel
API and examples/gcn_ns_dp.py
with 2 GCN Layers and 10 sampling neighbors. The batch size is 30000.
-
Strong Scalability Test (each model replica's batch size = 30000)
+------+---------------+ | GPUs | Epoch Time(s) | +------+---------------+ | 1 | 3.6018 | +------+---------------+ | 2 | 4.8981 | +------+---------------+ | 3 | 4.1862 | +------+---------------+
-
Weak Scalability Test (each model replica's batch size = 30000 / GPUs)
+------+---------------+ | GPUs | Epoch Time(s) | +------+---------------+ | 1 | 3.6018 | +------+---------------+ | 2 | 3.9829 | +------+---------------+ | 3 | 4.5653 | +------+---------------+