Adding my own dataset

Question

Adding my own dataset

baseballtrout opened this issue 7 months ago · comments

Hello, first off, this tool is it works really well. I am inspired by you all. My name is Bradford Patton, I got to school at Meharry Medical College. I am having one issue of adding my own knowledge graph. The process of processing my triples won't work. What's the best spacing for the triples that the KG is supposed to be in. It is giving me this error below. I figured I would message the masterminds behind this tool.

File "C:\Users\bpatton23\ULTRA\script\run.py", line 243, in
dataset = util.build_dataset(cfg)
File "C:\Users\bpatton23\ULTRA\ultra\util.py", line 149, in build_dataset
dataset = ds_cls(**data_config)
File "C:\Users\bpatton23\ULTRA\ultra\datasets.py", line 246, in init
super().init(root, transform, pre_transform)
File "C:\Users\bpatton23\AppData\Local\anaconda3\envs\UltraGPU\lib\site-packages\torch_geometric\data\in_memory_dataset.py", line 76, in init
super().init(root, transform, pre_transform, pre_filter, log)
File "C:\Users\bpatton23\AppData\Local\anaconda3\envs\UltraGPU\lib\site-packages\torch_geometric\data\dataset.py", line 102, in init
self._process()
File "C:\Users\bpatton23\AppData\Local\anaconda3\envs\UltraGPU\lib\site-packages\torch_geometric\data\dataset.py", line 235, in _process
self.process()
File "C:\Users\bpatton23\ULTRA\ultra\datasets.py", line 291, in process
train_results = self.load_file(train_files[0], inv_entity_vocab={}, inv_rel_vocab={})
File "C:\Users\bpatton23\ULTRA\ultra\datasets.py", line 264, in load_file
u, r, v = l.split() if self.delimiter is None else l.strip().split(self.delimiter)
ValueError: not enough values to unpack (expected 3, got 1)

Michael Galkin · Answer 1 · Thu Mar 21 2024 07:12:35 GMT+0800 (China Standard Time)

Hi, the default separator is a Tab symbol "\t", so the expected format of input triples is tsv.
You can adjust it to your case by setting delimiter = <your symol> in your custom dataset class, for example, delimiter = "," for comma-separated subject,predicate,object lines

Bradford Patton · Answer 2 · Thu Mar 21 2024 07:27:51 GMT+0800 (China Standard Time)

Thank you for the information, I will try to separate my data in test, train and valid files as tsv files.

Bradford Patton · Answer 3 · Fri Mar 22 2024 02:26:15 GMT+0800 (China Standard Time)

Hello again it's still not working and giving me the same error as before. I have my data in tsv files. Is it possible that if it has downloaded one file before it will just keep loading the same file and won't download a new file from a new link you put there?

Michael Galkin · Answer 4 · Fri Mar 22 2024 02:29:13 GMT+0800 (China Standard Time)

Yes, you have to clean up the dataset cache folder and download new files

Bradford Patton · Answer 5 · Fri Mar 22 2024 02:42:41 GMT+0800 (China Standard Time)

How? and where would this dataset cache folder be found?

Michael Galkin · Answer 6 · Fri Mar 22 2024 02:49:38 GMT+0800 (China Standard Time)

The default path in the config files (eg, for transductive inference) is ~/git/ULTRA/kg-datasets (unless you put your own path in the config). There, delete the folder named after your custom dataset and that should be sufficient.

Bradford Patton · Answer 7 · Fri Mar 22 2024 03:20:37 GMT+0800 (China Standard Time)

Thank you again it worked. I will try my datasets again and see if they will work.