SUREL+: Moving from Walks to Sets for Scalable Subgraph-based Graph Representation Learning

SUREL+ is a novel set-based computation framework for scaling subgraph-based GRL to industry-level graphs. It is the first time that SGRL has been successfully deployed on billion-edge graphs. SUREL+ breaks costly subgraph extraction into sampling multiple node sets, where their joint set can act as a proxy of query-induced subgraphs for predictions of multiple queries. For more details, please refer to our paper SUREL+: Moving from Walks to Sets for Scalable Subgraph-based Graph Representation Learning, to appear in VLDB'23.

SUREL+ benefits from the reusability of sampled node sets across different queries (e.g., link, motif), and its set form substantially reduces both memory and computational cost by eliminating heavy node duplication from walk-based sampling. SUREL+ designs dedicated sparse storage SpG and a sparse join operator SpJointo handle the irregular-sized node sets. It adopts a modular design, where users can flexibly choose different set samplers, structure encoders and set neural encoders to suit their own SGRL tasks.

Currently, SUREL+ framework supports the following:

Large-scale graph ML tasks: link prediction / relation-type prediction / higher-order pattern prediction
Preprocessing and training on nine datasets in Open Graph Benchmark (OGB) format
Flexible modules:
- Set Samplers: walk-based, metric-based (PPR)
- Structure Encoders: Landing Probabilities, Shortest Path Distances, PPR scores
- Neural Set Encoders: MLP+mean pooling, LSTM-based, Attention-based
Single GPU training and evaluation
Structural Features + Node Features

Updates

May 15, 2023:

support SubGAcc v2.3 for billion-edge graphs
add two industry-level graph benchmarks
- criteo-click with 16.5M records of online banner ads clicking
- twitter-follower with 1.5B following relations of users

Mar. 1, 2023:

support SubGAcc v2.2 and ogbl-vessel v1.1
better logger function
add model checkpoint and inference-only mode

Requirements

(Other versions may work but are untested)

Ubuntu 20.04
CUDA >= 11.3
python >= 3.8
1.11.0 <= pytorch <= 1.12.0

SGRL Environment Setup

Requirements: Python >= 3.8, Anaconda3

Update conda:

conda update -n base -c defaults conda

Install basic dependencies to the virtual environment and activate it:

conda env create -f environment.yml
conda activate sgrl-env

Update: SUREL now supports PyTorch 1.12.1 and PyG 2.2.0 with pyg-lib. To install them, simply run

conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
conda install pyg -c pyg
pip install pyg-lib -f https://data.pyg.org/whl/torch-1.12.0+cu113.html

For more details, please refer to the PyTorch, PyTorch Geometric and pyg-lib. The code of this repository is lately tested with Python 3.10.9 + PyTorch 1.12.1 (CUDA 11.3) + torch-geometric 2.2.0.

Quick Start

Install the required version of PyTorch that is compatible with your CUDA driver
Clone the repository git clone https://github.com/Graph-COM/SUREL_Plus.git
Build and install the SubGAcc library cd subg_acc;python3 setup.py install

To train SUREL+ for link prediction on collab with LP/LSTM:

python main.py --dataset ogbl-collab --metric Hits --sencoder LP --num_steps 3 --num_walks 200 --aggr lstm --use_val

To train SUREL+ for link prediction on ppa with LP/Attn:

python main.py --dataset ogbl-ppa --metric Hits --sencoder LP --num_steps 4 --num_walks 200 --k 20 --aggr attn

To train SUREL+ for link prediction on citation2 with PPR/Mean:

python main.py --dataset ogbl-citation2 --metric MRR --sencoder PPR --topk 100  --aggr mean

To train SUREL+ for vessel prediction with LP/Mean:

python main.py --dataset ogbl-vessel --metric AUC --sencoder LP --num_steps 2 --num_walks 50 --k 5 --aggr mean --use_raw --dropout 0.2

To train SUREL+ for relation type prediction on MAG(A-P):

python main.py --dataset mag --relation write --metric MRR --sencoder LP --num_steps 3 --num_walks 100

To train SUREL+ for higher-order pattern prediction on DBLP:

python main_horder.py --dataset DBLP-coauthor --metric MRR --num_steps 3 --num_walks 100

All detailed training logs can be found at <log_dir>/<dataset>/<training-timestamp>.log, and the saved checkpoint is under the model subfolder.

Usage

usage: main.py [-h] [--device DEVICE] [--log_steps LOG_STEPS] [--num_layers NUM_LAYERS]
               [--hidden_channels HIDDEN_CHANNELS] [--dropout DROPOUT]
               [--batch_size BATCH_SIZE] [--lr LR] [--train_ratio TRAIN_RATIO]
               [--valid_perc VALID_PERC] [--epochs EPOCHS] [--eval_steps EVAL_STEPS]
               [--early_stop EARLY_STOP] [--runs RUNS] [--seed SEED] [--alpha ALPHA]
               [--eps EPS] [--topk TOPK] [--num_walks NUM_WALKS]
               [--num_steps NUM_STEPS] [--k K] [--nthread NTHREAD]
               [--dataset {ogbl-ppa,ogbl-ddi,ogbl-citation2,ogbl-collab,ogbl-vessel,mag}]
               [--relation {write,cite}] [--metric {AUC,MRR,Hits}]
               [--aggrs {mean,lstm,attn}] [--sencoder {LP,PPR,SPD,DEG}] [--use_raw]
               [--use_weight] [--use_val] [--use_pretrain] [--load_ppr] [--save_ppr]
               [--inf_only] [--log_dir LOG_DIR] [--load_model LOAD_MODEL] [--debug]

Optional Arguments

optional arguments:
  -h, --help            show this help message and exit
  --device DEVICE
  --log_steps LOG_STEPS
  --num_layers NUM_LAYERS
  --hidden_channels HIDDEN_CHANNELS
  --dropout DROPOUT
  --batch_size BATCH_SIZE
  --lr LR
  --train_ratio TRAIN_RATIO
  --valid_perc VALID_PERC
  --epochs EPOCHS
  --eval_steps EVAL_STEPS
  --early_stop EARLY_STOP
  --runs RUNS
  --seed SEED           seed to initialize all the random modules
  --alpha ALPHA         teleport probability in PPR
  --eps EPS             precision of PPR approx
  --topk TOPK           sample size of node set
  --num_walks NUM_WALKS
                        number of walks
  --num_steps NUM_STEPS
                        step of walks
  --k K                 negative samples
  --nthread NTHREAD     number of threads
  --dataset {ogbl-ppa,ogbl-ddi,ogbl-citation2,ogbl-collab,ogbl-vessel,mag}
                        dataset name
  --relation {write,cite}
                        relation type
  --metric {AUC,MRR,Hits}
                        metric for evaluating performance
  --aggrs {mean,lstm,attn}
                        type of set neural encoder
  --sencoder {LP,PPR,SPD,DEG}
                        type of structure encoder
  --use_raw             whether to use raw features
  --use_weight          whether to use edge weight
  --use_val             whether to use validation as input
  --use_pretrain        whether to load pretrained embedding
  --load_ppr            whether to load precomputed ppr
  --save_ppr            whether to save calculated ppr
  --inf_only            whether to perform inference only
  --log_dir LOG_DIR     log directory
  --load_model LOAD_MODEL
                        saved model path
  --debug               whether to use debug mode

Citation

Please cite our paper if you are interested in our work.

@article{yin2023surel+,
  title={SUREL+: Moving from Walks to Sets for Scalable Subgraph-based Graph Representation Learning},
  author={Yin, Haoteng and Zhang, Muhan and Wang, Jianguo and Li, Pan},
  journal={Proceedings of the VLDB Endowment},
  volume={16},
  number={11},
  pages={2939-2948},
  year={2023}
}

Graph-COM / SUREL_Plus