Issue loading graph from KG-Hub

Question

Issue loading graph from KG-Hub

justaddcoffee opened this issue a year ago · comments

Seems to download, but I'm getting an error seemingly when the graph is being loaded. Possibly either the nodes or edges file is not what GRAPE expects?

To reproduce:

from grape.datasets.kghub import KGIDG
g = KGIDG(version='20220722')

Output:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File ~/anaconda3/lib/python3.9/site-packages/ensmallen/datasets/graph_retrieval.py:419, in RetrievedGraph.__call__(self)
    413 try:
    414     (
    415         node_types_number,
    416         nodes_number,
    417         edge_types_number,
    418         edges_number
--> 419     ) = edge_list_utils.build_optimal_lists_files(
    420         # NOTE: the following parameters are supported by the parser, but
    421         # so far we have not encountered a single use case where we actually used them.
    422         # original_node_type_path,
    423         # original_node_type_list_separator,
    424         # original_node_types_column_number,
    425         # original_node_types_column,
    426         # original_numeric_node_type_ids,
    427         # original_minimum_node_type_id,
    428         # original_node_type_list_header,
    429         # original_node_type_list_support_balanced_quotes,
    430         # original_node_type_list_rows_to_skip,
    431         # original_node_type_list_max_rows_number,
    432         # original_node_type_list_comment_symbol,
    433         # original_load_node_type_list_in_parallel,
    434         # original_node_type_list_is_correct,
    435         # node_types_number,
    436         target_node_type_list_path=target_node_type_list_path,
    437         target_node_type_list_separator='\t',
    438         target_node_type_list_node_types_column_number=0,
    439         original_node_path=node_path,
    440         original_node_list_header=graph_arguments.get(
    441             "node_list_header"
    442         ),
    443         original_node_list_support_balanced_quotes=graph_arguments.get(
    444             "node_list_support_balanced_quotes"
    445         ),
    446         node_list_rows_to_skip=graph_arguments.get(
    447             "node_list_rows_to_skip"
    448         ),
    449         node_list_is_correct=graph_arguments.get(
    450             "node_list_is_correct"
    451         ),
    452         node_list_max_rows_number=graph_arguments.get(
    453             "node_list_max_rows_number"
    454         ),
    455         node_list_comment_symbol=graph_arguments.get(
    456             "node_list_comment_symbol"
    457         ),
    458         default_node_type=graph_arguments.get(
    459             "default_node_type"
    460         ),
    461         original_nodes_column_number=graph_arguments.get(
    462             "nodes_column_number"
    463         ),
    464         original_nodes_column=graph_arguments.get(
    465             "nodes_column"
    466         ),
    467         original_node_types_separator=graph_arguments.get(
    468             "node_types_separator"
    469         ),
    470         original_node_list_separator=graph_arguments.get(
    471             "node_list_separator"
    472         ),
    473         original_node_list_node_types_column_number=graph_arguments.get(
    474             "node_list_node_types_column_number"
    475         ),
    476         original_node_list_node_types_column=graph_arguments.get(
    477             "node_list_node_types_column"
    478         ),
    479         nodes_number=graph_arguments.get("nodes_number"),
    480         # original_minimum_node_id,
    481         # original_numeric_node_ids,
    482         # original_node_list_numeric_node_type_ids,
    483         original_skip_node_types_if_unavailable=True,
    484         # It make sense to load the node list in parallel only when
    485         # you have to preprocess the node types, since otherwise the nodes number
    486         # would be unknown.
    487         original_load_node_list_in_parallel=target_node_type_list_path is not None,
    488         maximum_node_id=graph_arguments.get(
    489             "maximum_node_id"
    490         ),
    491         target_node_path=target_node_path,
    492         target_node_list_separator='\t',
    493         target_nodes_column=graph_arguments.get(
    494             "nodes_column"
    495         ),
    496         target_nodes_column_number=0,
    497         target_node_list_node_types_column_number=1,
    498         target_node_types_separator="|",
    499         # original_edge_type_path,
    500         # original_edge_type_list_separator,
    501         # original_edge_types_column_number,
    502         # original_edge_types_column,
    503         # original_numeric_edge_type_ids,
    504         # original_minimum_edge_type_id,
    505         # original_edge_type_list_header,
    506         # edge_type_list_rows_to_skip,
    507         # edge_type_list_max_rows_number,
    508         # edge_type_list_comment_symbol,
    509         # load_edge_type_list_in_parallel=True,
    510         # edge_type_list_is_correct,
    511         # edge_types_number,
    512         target_edge_type_list_path=target_edge_type_list_path,
    513         target_edge_type_list_separator='\t',
    514         target_edge_type_list_edge_types_column_number=0,
    515         original_edge_path=os.path.join(
    516             self._cache_path, graph_arguments["edge_path"]),
    517         original_edge_list_header=graph_arguments.get(
    518             "edge_list_header"
    519         ),
    520         original_edge_list_support_balanced_quotes=graph_arguments.get(
    521             "edge_list_support_balanced_quotes"
    522         ),
    523         original_edge_list_separator=graph_arguments.get(
    524             "edge_list_separator"
    525         ),
    526         original_sources_column_number=graph_arguments.get(
    527             "sources_column_number"
    528         ),
    529         original_sources_column=graph_arguments.get(
    530             "sources_column"
    531         ),
    532         original_destinations_column_number=graph_arguments.get(
    533             "destinations_column_number"
    534         ),
    535         original_destinations_column=graph_arguments.get(
    536             "destinations_column"
    537         ),
    538         original_edge_list_edge_types_column_number=graph_arguments.get(
    539             "edge_list_edge_types_column_number"
    540         ),
    541         original_edge_list_edge_types_column=graph_arguments.get(
    542             "edge_list_edge_types_column"
    543         ),
    544         default_edge_type=graph_arguments.get(
    545             "default_edge_type"
    546         ),
    547         original_weights_column_number=graph_arguments.get(
    548             "weights_column_number"
    549         ),
    550         original_weights_column=graph_arguments.get(
    551             "weights_column"
    552         ),
    553         default_weight=graph_arguments.get(
    554             "default_weight"
    555         ),
    556         original_edge_list_numeric_node_ids=graph_arguments.get(
    557             "edge_list_numeric_node_ids"
    558         ),
    559         skip_weights_if_unavailable=graph_arguments.get(
    560             "skip_weights_if_unavailable"
    561         ),
    562         skip_edge_types_if_unavailable=graph_arguments.get(
    563             "skip_edge_types_if_unavailable"
    564         ),
    565         edge_list_comment_symbol=graph_arguments.get(
    566             "edge_list_comment_symbol"
    567         ),
    568         edge_list_max_rows_number=graph_arguments.get(
    569             "edge_list_max_rows_number"
    570         ),
    571         edge_list_rows_to_skip=graph_arguments.get(
    572             "edge_list_rows_to_skip"
    573         ),
    574         load_edge_list_in_parallel=True,
    575         remove_chevrons=graph_arguments.get(
    576             "remove_chevrons"
    577         ),
    578         remove_spaces=graph_arguments.get(
    579             "remove_spaces"
    580         ),
    581         edges_number=graph_arguments.get("edges_number"),
    582         target_edge_path=target_edge_path,
    583         target_edge_list_separator='\t',
    584         sort_temporary_directory=self._sort_tmp_dir,
    585         directed=self._directed,
    586         verbose=self._verbose > 0,
    587         name=self._name,
    588     )
    589 except Exception as e:

ValueError: Cannot open the file at graphs/kghub/KGIDG/20220722/KG-IDG/merged-kg_nodes.tsv

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
Input In [32], in <cell line: 2>()
      1 from grape.datasets.kghub import KGIDG
----> 2 g = KGIDG(version='20220722')

File ~/anaconda3/lib/python3.9/site-packages/ensmallen/datasets/kghub.py:159, in KGIDG(directed, preprocess, bioregistry, load_nodes, load_node_types, load_edge_types, load_edge_weights, auto_enable_tradeoffs, sort_tmp_dir, verbose, ring_bell, cache, cache_path, cache_sys_var, version, **kwargs)
     95 def KGIDG(
     96     directed=False, preprocess="auto", bioregistry=False, load_nodes=True, load_node_types=True,
     97     load_edge_types=True, load_edge_weights=True, auto_enable_tradeoffs=True,
     98     sort_tmp_dir=None, verbose=2, ring_bell=False, cache=True, cache_path=None,
     99     cache_sys_var="GRAPH_CACHE_DIR", version="current", **kwargs
    100 ) -> Graph:
    101     """Return KG-IDG graph	
    102 
    103     Parameters
   (...)
    157 	
    158     """
--> 159     return RetrievedGraph(
    160         "KGIDG", version, "kghub", directed, preprocess, bioregistry, load_nodes,
    161         load_node_types, load_edge_types, load_edge_weights, auto_enable_tradeoffs, sort_tmp_dir,
    162         verbose, ring_bell, cache, cache_path, cache_sys_var, kwargs
    163     )()

File ~/anaconda3/lib/python3.9/site-packages/ensmallen/datasets/graph_retrieval.py:590, in RetrievedGraph.__call__(self)
    414     (
    415         node_types_number,
    416         nodes_number,
   (...)
    587         name=self._name,
    588     )
    589 except Exception as e:
--> 590     raise RuntimeError(
    591         f"Something went wrong while preprocessing the graph {self._name}, "
    592         f"version {self._version}, "
    593         f"retrieved from the {self._repository} repository. "
    594         "This is NOT the loading step, but a preprocessing step "
    595         "that loads remote data from third parties. "
    596         "As such there may have been some changes in the remote data "
    597         "that may have made them incompatible with the current "
    598         "expected parametrization. "
    599         "Do open up an issue in the Ensmallen's GitHub repository reporting also the complete"
    600         "exception of this error to help us keep the automatic graph retrieval "
    601         "in good shape. Thank you!"
    602     ) from e
    603 # Store the obtained metadata
    604 self.store_preprocessed_metadata(
    605     node_types_number,
    606     nodes_number,
    607     edge_types_number,
    608     edges_number
    609 )

RuntimeError: Something went wrong while preprocessing the graph KGIDG, version 20220722, retrieved from the kghub repository. This is NOT the loading step, but a preprocessing step that loads remote data from third parties. As such there may have been some changes in the remote data that may have made them incompatible with the current expected parametrization. Do open up an issue in the Ensmallen's GitHub repository reporting also the completeexception of this error to help us keep the automatic graph retrieval in good shape. Thank you!

Luca Cappelletti · Answer 1 · Fri Jun 02 2023 17:22:31 GMT+0800 (China Standard Time)

The issue of that version is that it does not follow the agreed for format. Specifically, in the directory there it contains the files:

data
- merged
  - merged-kg_edges.tsv
  - merged-kg_nodes.tsv
neg_train_edges.tsv
pos_valid_edges.tsv
merged_graph_stats_20220722.yaml
neg_valid_edges.tsv

Is there any particular reason for this change in data format? Are the future data releases of kghub to change the data format? Please do let me know so that I can update the relative metadata.

Justin Reese · Answer 2 · Fri Jun 02 2023 17:27:55 GMT+0800 (China Standard Time)

@caufieldjh any thoughts on this issue? looks like the file format in kg-idg changed

Harry Caufield · Answer 3 · Fri Jun 02 2023 22:35:43 GMT+0800 (China Standard Time)

For several KG-IDG builds, I was storing pre-separated train+valid subgraphs on KG-Hub, but have since removed them.
Does the newest KG-IDG build (20230601) work?
If so, I can go back and fix the offending builds accordingly.

Luca Cappelletti · Answer 4 · Tue Jun 06 2023 15:20:53 GMT+0800 (China Standard Time)

The following runs, I will get the list of versions that currently fail.

from grape.datasets.kghub import KGIDG

KGIDG(version="20230530")

Luca Cappelletti · Answer 5 · Tue Jun 06 2023 15:56:58 GMT+0800 (China Standard Time)

The list of versions that are failing are:

20211029
20220601
20220606
20220701
20220722

Harry Caufield · Answer 6 · Tue Jun 06 2023 21:58:01 GMT+0800 (China Standard Time)

Excellent, thanks. I'll get them fixed.