qyliu-hkust / light-dist-gnn

A lightweight distributed GNN library for full batch node property prediction.

Features/Changelog

Complete refactoring of CAGNET.
Distributed utilities such as log, timer, etc.
Node feature cached training.
Partitioned graph cache on disk.
More datasets. Most large graphs from pyg, dgl, ogb supported.
Training depends on pytorch only.
Distributed GAT training.
Latest pytorch version supported.
CSR graph supported.
Half precision training supported.

Getting started

Setup a clean environment.

conda create --name gnn
conda activate gnn

Install pytorch (needed for training) and other libraries (needed for downloading datasets).

// Cuda 10:
conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch-lts
conda install -c dglteam dgl-cuda10.2
conda install pyg -c pyg -c conda-forge
pip install ogb

// Cuda 11:
conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch-lts -c nvidia
conda install -c dglteam dgl-cuda11.1
conda install pyg -c pyg -c conda-forge
pip install ogb

Compile and install spmm. (Optional. CUDA dev environment needed.)

cd spmm_cpp
python setup.py install

Prepare datasets (edit the code according to your needs).

//This may take a while.
python prepare_data.py

Train.

python main.py

Experiments for Sancus: Staleness-Aware Communication-Avoiding Full-Graph Decentralized Training in Large-Scale Graph Neural Networks

Check the steps in Getting started .
Check dataset, epoch, and num of GPUs in main.py.
Check model settings in dist_train.py
Check cache methods in models.
Run and see the result.

Contact

Contact chenzhao@ust.hk for any problems.

About

Languages

Language:Python 76.4%Language:C++ 23.6%