IllinoisGraphBenchmark / IGB-Datasets

Largest realworld open-source graph dataset - Worked done under IBM-Illinois Discovery Accelerator Institute and Amazon Research Awards and in collaboration with NVIDIA Research.

Home Page:https://arxiv.org/abs/2302.13522

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

cannot load npy files for igbh-large dataset.

kaixuanliu opened this issue · comments

Describe the bug
I can download igbh-large dataset files now. But I cannot load them using numpy, even if I add allow_pickle=True argument.

Screenshots
image

Could you please help me how to load these files? It is ok when I use numpy.load to load files from igbh-tiny

Due to their large size, IGB(H)-large node embedding files are saved in numpy memmapped format. Please look at the data loader.py file which handles loading the data. This function here is used to load the embedding using np.memmap.

def paper_feat(self) -> np.ndarray:
num_nodes = self.num_nodes()
# TODO: temp for bafs. large and full special case
if self.size == 'large' or self.size == 'full':
path = osp.join(self.dir, 'full', 'processed', 'paper', 'node_feat.npy')
emb = np.memmap(path, dtype='float32', mode='r', shape=(num_nodes,1024))
else:
path = osp.join(self.dir, self.size, 'processed', 'paper', 'node_feat.npy')
if self.synthetic:
emb = np.random.rand(num_nodes, 1024).astype('f')
else:
if self.in_memory:
emb = np.load(path)
else:
emb = np.load(path, mmap_mode='r')
return emb

So in your case you would try something like: (I didn't dry run this code snippet)

dir = "/lfs/lfs12/kaixuan/igb"
path = osp.join(dir, 'large', 'processed', 'paper', 'node_feat.npy')
num_nodes = 100000000 #num of paper nodes in IGBH-large
paper_feat = np.memmap(path, dtype='float32', mode='r',  shape=(num_nodes,1024))

Without memmap the size is too unwieldy to read. We are currently adding some documentation and we really appreciate these questions so we know what to add to it :)

Please let us know if you have any more issues or suggestions (maybe a better alternative to memmap?).

paper_feat = np.memmap(path, dtype='float32', mode='r',  shape=(num_nodes,1024))

cool, this works for me. And I think this solution is good enough to deal with this kind of large numpy data load problem

When running IGBH600 - please note - np.memap will be extremely slow. Training time will be dominated by the cost of doing page faults using mmap.

The OS has some tricks to solve this problem, but they require building some library to expose the data as a tensor. If you end up writing something on this, we are open to a PR request. If you have alternate ideas, let us know!

@akhatua2
I have been running IGB-Large and in my case I have sufficient DRAM (~750G), I have modified the dataloader to use np.load instead of memap. But there are these pickling related errors being thrown:

  • allow_picke = true
Traceback (most recent call last):                                                                                                                                       
  File "/storage/utk/miniconda3/envs/dgl/lib/python3.11/site-packages/numpy/lib/npyio.py", line 465, in load                                                             
    return pickle.load(fid, **pickle_kwargs)                                                                                                                             
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                             
_pickle.UnpicklingError: invalid load key, '\xd8'.                                                                                                                       
                                                                                                                                                                         
The above exception was the direct cause of the following exception:                                                                                                     
                                                                                                                                                                         
Traceback (most recent call last):                                                                                                                                       
  File "/disk/IGB-Datasets/igb/train_single_gpu.py", line 197, in <module>                                                                                               
    dataset = IGB260MDGLDataset(args)                                                                                                                                    
              ^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                    
  File "/disk/IGB-Datasets/igb/dataloader.py", line 103, in __init__                                                                                                     
    super().__init__(name='IGB260MDGLDataset')                                                                                                                           
  File "/storage/utk/miniconda3/envs/dgl/lib/python3.11/site-packages/dgl-1.1.1-py3.11-linux-x86_64.egg/dgl/data/dgl_dataset.py", line 112, in __init__                  
    self._load()                                                                                                                                                         
  File "/storage/utk/miniconda3/envs/dgl/lib/python3.11/site-packages/dgl-1.1.1-py3.11-linux-x86_64.egg/dgl/data/dgl_dataset.py", line 203, in _load                     
    self.process()                                                                                                                                                       
  File "/disk/IGB-Datasets/igb/dataloader.py", line 109, in process                                                                                                      
    node_features = torch.from_numpy(dataset.paper_feat)                                                                                                                 
                                     ^^^^^^^^^^^^^^^^^^                                                                                                                  
  File "/disk/IGB-Datasets/igb/dataloader.py", line 39, in paper_feat                                                                                                    
    emb = np.load(path, allow_pickle=True)                                                                                                                               
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                               
  File "/storage/utk/miniconda3/envs/dgl/lib/python3.11/site-packages/numpy/lib/npyio.py", line 467, in load                                                             
    raise pickle.UnpicklingError(                                                                                                                                        
_pickle.UnpicklingError: Failed to interpret file '/disk/igb/igb_large/full/processed/paper/node_feat.npy' as a pickle 
  • allow_picke = false
Traceback (most recent call last):
  File "/disk/IGB-Datasets/igb/train_single_gpu.py", line 197, in <module>
    dataset = IGB260MDGLDataset(args)
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/disk/IGB-Datasets/igb/dataloader.py", line 103, in __init__
    super().__init__(name='IGB260MDGLDataset')
  File "/storage/utk/miniconda3/envs/dgl/lib/python3.11/site-packages/dgl-1.1.1-py3.11-linux-x86_64.egg/dgl/data/dgl_dataset.py", line 112, in __init__
    self._load()
  File "/storage/utk/miniconda3/envs/dgl/lib/python3.11/site-packages/dgl-1.1.1-py3.11-linux-x86_64.egg/dgl/data/dgl_dataset.py", line 203, in _load
    self.process()
  File "/disk/IGB-Datasets/igb/dataloader.py", line 109, in process
    node_features = torch.from_numpy(dataset.paper_feat)
                                     ^^^^^^^^^^^^^^^^^^
  File "/disk/IGB-Datasets/igb/dataloader.py", line 39, in paper_feat
    emb = np.load(path)
          ^^^^^^^^^^^^^
  File "/storage/utk/miniconda3/envs/dgl/lib/python3.11/site-packages/numpy/lib/npyio.py", line 462, in load
    raise ValueError("Cannot load file containing pickled data "
ValueError: Cannot load file containing pickled data when allow_pickle=False

I see you have mentioned the .npy files have been stored in a special way, can you suggest how to mitigate the above issue?

Hey @UtkrishtP the way to do it would be reading the files in the np.memmap way shown in the dataloader and then np.save that in your disk. From there you can np.load it into your memory. Let me know if that makes sense.

We don't use the non-memmap files due to the "extremely" long read/write speeds.