alibaba / graphlearn-for-pytorch

A GPU-accelerated graph learning library for PyTorch, facilitating the scaling of GNN training and inference.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

index out of bounds for partition book List

kaixuanliu opened this issue Β· comments

πŸ› Describe the bug

I ran into this problem when running distributed training for igbh-large dataset. Just keep a record here, if you meet the same problem or have solved this, please let me know~

task failed: index 40437200 is out of bounds for dimension 0 with size 4490
ERROR:root:coroutine task failed: index 43094749 is out of bounds for dimension 0 with size 4490
ERROR:root:coroutine task failed: index 36034227 is out of bounds for dimension 0 with size 4490
ERROR:root:coroutine task failed: index 41991547 is out of bounds for dimension 0 with size 4490
ERROR:root:coroutine task failed: index 44125491 is out of bounds for dimension 0 with size 4490
ERROR:root:coroutine task failed: index 31882725 is out of bounds for dimension 0 with size 4490

and the out of bound error is form here:

cmd lines:
node0:
python dist_train_rgnn.py --num_nodes=4 --node_rank=0 --num_training_procs=1 --master_addr=172.31.44.3 --model='rgat' --dataset_size='large' --num_classes=19
node1:
python dist_train_rgnn.py --num_nodes=4 --node_rank=1 --num_training_procs=1 --master_addr=172.31.44.3 --model='rgat' --dataset_size='large' --num_classes=19
node2:
python dist_train_rgnn.py --num_nodes=4 --node_rank=2 --num_training_procs=1 --master_addr=172.31.44.3 --model='rgat' --dataset_size='large' --num_classes=19
node3:
python dist_train_rgnn.py --num_nodes=4 --node_rank=3 --num_training_procs=1 --master_addr=172.31.44.3 --model='rgat' --dataset_size='large' --num_classes=19

Environment

Environment
GLT version: 0.2.0(build from latest source code)
PyG version: 2.3.1
PyTorch version: 1.13.1+cpu
OS: Ubuntu 22.04.2 LTS
Python version: 3.8.16
CUDA/cuDNN version: N/A

It may be because igbh-large dataset has 2 another node type ('conference' and 'journal') which do not exist in igbh-tiny/small/medium, and we do not process them in dataset.py. I will try to fix it.

commented

@kaixuanliu How was the data partitioned? Partitioning the dataset in each of the four nodes may incur this problem as there exists randomness in the process of partitioning. If the dataset was partitioned in each node independently in your experiment, try partitioning it using one node and copy the partitioned data to the rest.

I use NFS and just partition the dataset once.

commented

I use NFS and just partition the dataset once.

I see, using NFS should be fine.

But journal and conference nodes and relevant edges are covered for the large and full datasets in dataset.py.

I will try to reproduce this problem.

But journal and conference nodes and relevant edges are covered for the large and full datasets in dataset.py.

Yes, I checked this, these part is ok. And I root caused the bug. Here is the problem, when we do not have neighbors in one partition, the sampled neighbor output will use input seeds, while in distributed training, we need to get the partition book of sampled output, here we will get dst node partition book using src node global id, hence it will cause index out of bounds error.

Thanks for your feedback. I agree this is the problem. Will seek a solution.

seems dgl process this kind of situation using a different approach:dgl reference

Yes, we are considering using an empty tensor when sampling nothing.

It seems just using empty tensors can fix this and no other modification is necessary in my environment. Would you like to try it first? Will push it after holiday if no further problems.
Here

    if nbrs.numel() == 0:
      # nbrs, nbrs_num = input_seeds, torch.ones_like(input_seeds)
      # if self.with_edge:
      #   edge_ids = -1 * nbrs_num
      nbrs = torch.tensor([], dtype=torch.int64 ,device=self.device)
      nbrs_num = torch.zeros_like(input_seeds, dtype=torch.int64, device=self.device)
      edge_ids = torch.tensor([], dtype=torch.int64, device=self.device) if self.with_edge else None

And before Here
Add

  if output.nbr.size(0) > 0:

Another minor changes needed, and I have verified it for 2 epochs in igbh-large dataset. 1 PR submitted. FYI.

Closed by #49