Sampling for Heterogeneous GNNs
Abstract
Graph sampling is a popular technique in training large-scale graph neural networks (GNNs), recent sampling-based methods have demonstrated impressive success for homogeneous graphs. However, in practice, the interaction between different entities is often different based on their relationship, i.e., the network in reality is mostly heterogeneous. But only a few of the recent works have paid attention to sampling methods on heterogeneous graphs. In this work, we aim to study sampling for heterogeneous GNNs. We propose two general pipelines for heterogeneous sampling. Based on the proposed pipeline, we evaluate 3 representative sampling methods on heterogeneous graphs, including node-wise sampling, layer-wise sampling, and subgraph-wise sampling. To the best of our knowledge, we are the first to provide a thorough implementation, evaluation, and discussion of each sampling method on heterogeneous graphs. Extensive experiments compared sampling methods from multiple aspects and highlight their characteristics for each category. Evaluation of scalability on larger-scale heterogeneous graphs also shows we achieve the trade-off between efficiency and effectiveness. Last, we also analyze the limitations of our proposed pipeline on heterogeneous sub-graph sampling and provide a detailed comparison with HGSampling.
Requirements
- PyTorch 1.0+
- requests
- rdflib
pip install dgl-cu101 dglgo -f https://data.dgl.ai/wheels/repo.html
pip install requests torch rdflib pandas
Example code was tested with rdflib 4.2.2 and pandas 0.23.4
Datasets
The preprocessing is slightly different from the author's code. We directly load and preprocess raw RDF data. For AIFB, BGS and AM, all literal nodes are pruned from the graph. For AIFB, some training/testing nodes thus become orphan and are excluded from the training/testing set. The resulting graph has fewer entities and relations. As a reference (numbers include reverse edges and relations):
Dataset | #Nodes | #Edges | #Relations | #Labeled |
---|---|---|---|---|
AIFB | 8,285 | 58,086 | 90 | 176 |
AIFB-hetero | 7,262 | 48,810 | 78 | 176 |
MUTAG | 23,644 | 148,454 | 46 | 340 |
MUTAG-hetero | 27,163 | 148,100 | 46 | 340 |
BGS | 333,845 | 1,832,398 | 206 | 146 |
BGS-hetero | 94,806 | 672,884 | 96 | 146 |
AM | 1,666,764 | 11,976,642 | 266 | 1000 |
AM-hetero | 881,680 | 5,668,682 | 96 | 1000 |
To evaluate the scalability of sampling methods on larger-scale heterogeneous graphs, we also include OGBN-MAG dataset, including four types of entities—papers (736,389 nodes), authors (1,134,649 nodes), institutions (8,740 nodes), and fields of study (59,965 nodes).
Demo
Check demo or Google Colab for more results.
Usage
Please put sampling code at
dgl/examples/pytorch/rgcn-hetero/
For node-wise sampling:
python NodeSampler.py -d aifb --testing --gpu 0 --fanout=8
python NodeSampler.py -d mutag --l2norm 5e-4 --testing --gpu 0 --fanout=8
python NodeSampler.py -d bgs --l2norm 5e-4 --n-bases 40 --testing --gpu 0
python NodeSampler.py -d am --l2norm 5e-4 --n-bases 40 --testing --gpu 0 --fanout=16 --batch-size 50
For layer-wise sampling:
python LayerSampler.py -d aifb --testing --gpu 0 --fanout=8
python LayerSampler.py -d mutag --l2norm 5e-4 --testing --gpu 0 --fanout=8
python LayerSampler.py -d bgs --l2norm 5e-4 --n-bases 40 --testing --gpu 0
python LayerSampler.py -d am --l2norm 5e-4 --n-bases 40 --testing --gpu 0 --fanout=16 --batch-size 50
For subgraph-wise sampling (ShaDowKHopSampler):
python ShaDowKHopSampler.py -d aifb --testing --gpu 0 --fanout=8
python ShaDowKHopSampler.py -d mutag --l2norm 5e-4 --testing --gpu 0 --fanout=8
python ShaDowKHopSampler.py -d bgs --l2norm 5e-4 --n-bases 40 --testing --gpu 0
python ShaDowKHopSampler.py -d am --l2norm 5e-4 --n-bases 40 --testing --gpu 0 --fanout=16 --batch-size 50
For subgraph-wise sampling (ClusterGCNSampler):
See method2_cluster-gcn.
Acknowledgements
We would like to thank Yewen (Emily) Wang, Prof. Yizhou Sun and DGL community for helpful discussions and comments.