index out of bounds for partition book List

Question

index out of bounds for partition book List

kaixuanliu opened this issue a year ago · comments

🐛 Describe the bug

I ran into this problem when running distributed training for igbh-large dataset. Just keep a record here, if you meet the same problem or have solved this, please let me know~

task failed: index 40437200 is out of bounds for dimension 0 with size 4490
ERROR:root:coroutine task failed: index 43094749 is out of bounds for dimension 0 with size 4490
ERROR:root:coroutine task failed: index 36034227 is out of bounds for dimension 0 with size 4490
ERROR:root:coroutine task failed: index 41991547 is out of bounds for dimension 0 with size 4490
ERROR:root:coroutine task failed: index 44125491 is out of bounds for dimension 0 with size 4490
ERROR:root:coroutine task failed: index 31882725 is out of bounds for dimension 0 with size 4490

and the out of bound error is form here:

cmd lines:
node0:
python dist_train_rgnn.py --num_nodes=4 --node_rank=0 --num_training_procs=1 --master_addr=172.31.44.3 --model='rgat' --dataset_size='large' --num_classes=19
node1:
python dist_train_rgnn.py --num_nodes=4 --node_rank=1 --num_training_procs=1 --master_addr=172.31.44.3 --model='rgat' --dataset_size='large' --num_classes=19
node2:
python dist_train_rgnn.py --num_nodes=4 --node_rank=2 --num_training_procs=1 --master_addr=172.31.44.3 --model='rgat' --dataset_size='large' --num_classes=19
node3:
python dist_train_rgnn.py --num_nodes=4 --node_rank=3 --num_training_procs=1 --master_addr=172.31.44.3 --model='rgat' --dataset_size='large' --num_classes=19

Environment

Environment
GLT version: 0.2.0(build from latest source code)
PyG version: 2.3.1
PyTorch version: 1.13.1+cpu
OS: Ubuntu 22.04.2 LTS
Python version: 3.8.16
CUDA/cuDNN version: N/A

Yao Matrix commented a year ago

@LiSu

kaixuanliu · Answer 1 · Tue Jun 13 2023 13:28:52 GMT+0800 (China Standard Time)

It may be because igbh-large dataset has 2 another node type ('conference' and 'journal') which do not exist in igbh-tiny/small/medium, and we do not process them in dataset.py. I will try to fix it.

kaixuanliu · Answer 2 · Tue Jun 13 2023 13:29:26 GMT+0800 (China Standard Time)

kaixuanliu commented a year ago

LiSu · Answer 3 · Tue Jun 13 2023 14:13:57 GMT+0800 (China Standard Time)

@kaixuanliu How was the data partitioned? Partitioning the dataset in each of the four nodes may incur this problem as there exists randomness in the process of partitioning. If the dataset was partitioned in each node independently in your experiment, try partitioning it using one node and copy the partitioned data to the rest.

kaixuanliu · Answer 4 · Tue Jun 13 2023 14:16:32 GMT+0800 (China Standard Time)

I use NFS and just partition the dataset once.

LiSu · Answer 5 · Tue Jun 13 2023 14:38:03 GMT+0800 (China Standard Time)

I use NFS and just partition the dataset once.

I see, using NFS should be fine.

But journal and conference nodes and relevant edges are covered for the large and full datasets in dataset.py.

I will try to reproduce this problem.

kaixuanliu · Answer 6 · Sun Jun 18 2023 23:35:52 GMT+0800 (China Standard Time)

But journal and conference nodes and relevant edges are covered for the large and full datasets in dataset.py.

Yes, I checked this, these part is ok. And I root caused the bug. Here is the problem, when we do not have neighbors in one partition, the sampled neighbor output will use input seeds, while in distributed training, we need to get the partition book of sampled output, here we will get dst node partition book using src node global id, hence it will cause index out of bounds error.

husimplicity · Answer 7 · Mon Jun 19 2023 11:48:34 GMT+0800 (China Standard Time)

Thanks for your feedback. I agree this is the problem. Will seek a solution.

kaixuanliu · Answer 8 · Mon Jun 19 2023 21:18:47 GMT+0800 (China Standard Time)

seems dgl process this kind of situation using a different approach:dgl reference

husimplicity · Answer 9 · Tue Jun 20 2023 11:25:13 GMT+0800 (China Standard Time)

Yes, we are considering using an empty tensor when sampling nothing.

husimplicity · Answer 10 · Thu Jun 22 2023 21:25:52 GMT+0800 (China Standard Time)

It seems just using empty tensors can fix this and no other modification is necessary in my environment. Would you like to try it first? Will push it after holiday if no further problems.
Here

    if nbrs.numel() == 0:
      # nbrs, nbrs_num = input_seeds, torch.ones_like(input_seeds)
      # if self.with_edge:
      #   edge_ids = -1 * nbrs_num
      nbrs = torch.tensor([], dtype=torch.int64 ,device=self.device)
      nbrs_num = torch.zeros_like(input_seeds, dtype=torch.int64, device=self.device)
      edge_ids = torch.tensor([], dtype=torch.int64, device=self.device) if self.with_edge else None

And before Here
Add

  if output.nbr.size(0) > 0:

kaixuanliu · Answer 11 · Sun Jun 25 2023 06:38:44 GMT+0800 (China Standard Time)

Another minor changes needed, and I have verified it for 2 epochs in igbh-large dataset. 1 PR submitted. FYI.

husimplicity · Answer 12 · Sun Jun 25 2023 11:24:52 GMT+0800 (China Standard Time)

Closed by #49