cannot load npy files for igbh-large dataset.
kaixuanliu opened this issue · comments
Due to their large size, IGB(H)-large node embedding files are saved in numpy memmapped format. Please look at the data loader.py
file which handles loading the data. This function here is used to load the embedding using np.memmap
.
IGB-Datasets/igb/dataloader.py
Lines 33 to 49 in 1f4628f
So in your case you would try something like: (I didn't dry run this code snippet)
dir = "/lfs/lfs12/kaixuan/igb"
path = osp.join(dir, 'large', 'processed', 'paper', 'node_feat.npy')
num_nodes = 100000000 #num of paper nodes in IGBH-large
paper_feat = np.memmap(path, dtype='float32', mode='r', shape=(num_nodes,1024))
Without memmap the size is too unwieldy to read. We are currently adding some documentation and we really appreciate these questions so we know what to add to it :)
Please let us know if you have any more issues or suggestions (maybe a better alternative to memmap?).
paper_feat = np.memmap(path, dtype='float32', mode='r', shape=(num_nodes,1024))
cool, this works for me. And I think this solution is good enough to deal with this kind of large numpy data load problem
When running IGBH600 - please note - np.memap will be extremely slow. Training time will be dominated by the cost of doing page faults using mmap.
The OS has some tricks to solve this problem, but they require building some library to expose the data as a tensor. If you end up writing something on this, we are open to a PR request. If you have alternate ideas, let us know!
@akhatua2
I have been running IGB-Large and in my case I have sufficient DRAM (~750G), I have modified the dataloader to use np.load instead of memap. But there are these pickling related errors being thrown:
allow_picke = true
Traceback (most recent call last):
File "/storage/utk/miniconda3/envs/dgl/lib/python3.11/site-packages/numpy/lib/npyio.py", line 465, in load
return pickle.load(fid, **pickle_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_pickle.UnpicklingError: invalid load key, '\xd8'.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/disk/IGB-Datasets/igb/train_single_gpu.py", line 197, in <module>
dataset = IGB260MDGLDataset(args)
^^^^^^^^^^^^^^^^^^^^^^^
File "/disk/IGB-Datasets/igb/dataloader.py", line 103, in __init__
super().__init__(name='IGB260MDGLDataset')
File "/storage/utk/miniconda3/envs/dgl/lib/python3.11/site-packages/dgl-1.1.1-py3.11-linux-x86_64.egg/dgl/data/dgl_dataset.py", line 112, in __init__
self._load()
File "/storage/utk/miniconda3/envs/dgl/lib/python3.11/site-packages/dgl-1.1.1-py3.11-linux-x86_64.egg/dgl/data/dgl_dataset.py", line 203, in _load
self.process()
File "/disk/IGB-Datasets/igb/dataloader.py", line 109, in process
node_features = torch.from_numpy(dataset.paper_feat)
^^^^^^^^^^^^^^^^^^
File "/disk/IGB-Datasets/igb/dataloader.py", line 39, in paper_feat
emb = np.load(path, allow_pickle=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/storage/utk/miniconda3/envs/dgl/lib/python3.11/site-packages/numpy/lib/npyio.py", line 467, in load
raise pickle.UnpicklingError(
_pickle.UnpicklingError: Failed to interpret file '/disk/igb/igb_large/full/processed/paper/node_feat.npy' as a pickle
allow_picke = false
Traceback (most recent call last):
File "/disk/IGB-Datasets/igb/train_single_gpu.py", line 197, in <module>
dataset = IGB260MDGLDataset(args)
^^^^^^^^^^^^^^^^^^^^^^^
File "/disk/IGB-Datasets/igb/dataloader.py", line 103, in __init__
super().__init__(name='IGB260MDGLDataset')
File "/storage/utk/miniconda3/envs/dgl/lib/python3.11/site-packages/dgl-1.1.1-py3.11-linux-x86_64.egg/dgl/data/dgl_dataset.py", line 112, in __init__
self._load()
File "/storage/utk/miniconda3/envs/dgl/lib/python3.11/site-packages/dgl-1.1.1-py3.11-linux-x86_64.egg/dgl/data/dgl_dataset.py", line 203, in _load
self.process()
File "/disk/IGB-Datasets/igb/dataloader.py", line 109, in process
node_features = torch.from_numpy(dataset.paper_feat)
^^^^^^^^^^^^^^^^^^
File "/disk/IGB-Datasets/igb/dataloader.py", line 39, in paper_feat
emb = np.load(path)
^^^^^^^^^^^^^
File "/storage/utk/miniconda3/envs/dgl/lib/python3.11/site-packages/numpy/lib/npyio.py", line 462, in load
raise ValueError("Cannot load file containing pickled data "
ValueError: Cannot load file containing pickled data when allow_pickle=False
I see you have mentioned the .npy files have been stored in a special way, can you suggest how to mitigate the above issue?
Hey @UtkrishtP the way to do it would be reading the files in the np.memmap
way shown in the dataloader and then np.save
that in your disk. From there you can np.load
it into your memory. Let me know if that makes sense.
We don't use the non-memmap files due to the "extremely" long read/write speeds.