Issue loading graph from KG-Hub
justaddcoffee opened this issue · comments
Seems to download, but I'm getting an error seemingly when the graph is being loaded. Possibly either the nodes or edges file is not what GRAPE expects?
To reproduce:
from grape.datasets.kghub import KGIDG
g = KGIDG(version='20220722')
Output:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File ~/anaconda3/lib/python3.9/site-packages/ensmallen/datasets/graph_retrieval.py:419, in RetrievedGraph.__call__(self)
413 try:
414 (
415 node_types_number,
416 nodes_number,
417 edge_types_number,
418 edges_number
--> 419 ) = edge_list_utils.build_optimal_lists_files(
420 # NOTE: the following parameters are supported by the parser, but
421 # so far we have not encountered a single use case where we actually used them.
422 # original_node_type_path,
423 # original_node_type_list_separator,
424 # original_node_types_column_number,
425 # original_node_types_column,
426 # original_numeric_node_type_ids,
427 # original_minimum_node_type_id,
428 # original_node_type_list_header,
429 # original_node_type_list_support_balanced_quotes,
430 # original_node_type_list_rows_to_skip,
431 # original_node_type_list_max_rows_number,
432 # original_node_type_list_comment_symbol,
433 # original_load_node_type_list_in_parallel,
434 # original_node_type_list_is_correct,
435 # node_types_number,
436 target_node_type_list_path=target_node_type_list_path,
437 target_node_type_list_separator='\t',
438 target_node_type_list_node_types_column_number=0,
439 original_node_path=node_path,
440 original_node_list_header=graph_arguments.get(
441 "node_list_header"
442 ),
443 original_node_list_support_balanced_quotes=graph_arguments.get(
444 "node_list_support_balanced_quotes"
445 ),
446 node_list_rows_to_skip=graph_arguments.get(
447 "node_list_rows_to_skip"
448 ),
449 node_list_is_correct=graph_arguments.get(
450 "node_list_is_correct"
451 ),
452 node_list_max_rows_number=graph_arguments.get(
453 "node_list_max_rows_number"
454 ),
455 node_list_comment_symbol=graph_arguments.get(
456 "node_list_comment_symbol"
457 ),
458 default_node_type=graph_arguments.get(
459 "default_node_type"
460 ),
461 original_nodes_column_number=graph_arguments.get(
462 "nodes_column_number"
463 ),
464 original_nodes_column=graph_arguments.get(
465 "nodes_column"
466 ),
467 original_node_types_separator=graph_arguments.get(
468 "node_types_separator"
469 ),
470 original_node_list_separator=graph_arguments.get(
471 "node_list_separator"
472 ),
473 original_node_list_node_types_column_number=graph_arguments.get(
474 "node_list_node_types_column_number"
475 ),
476 original_node_list_node_types_column=graph_arguments.get(
477 "node_list_node_types_column"
478 ),
479 nodes_number=graph_arguments.get("nodes_number"),
480 # original_minimum_node_id,
481 # original_numeric_node_ids,
482 # original_node_list_numeric_node_type_ids,
483 original_skip_node_types_if_unavailable=True,
484 # It make sense to load the node list in parallel only when
485 # you have to preprocess the node types, since otherwise the nodes number
486 # would be unknown.
487 original_load_node_list_in_parallel=target_node_type_list_path is not None,
488 maximum_node_id=graph_arguments.get(
489 "maximum_node_id"
490 ),
491 target_node_path=target_node_path,
492 target_node_list_separator='\t',
493 target_nodes_column=graph_arguments.get(
494 "nodes_column"
495 ),
496 target_nodes_column_number=0,
497 target_node_list_node_types_column_number=1,
498 target_node_types_separator="|",
499 # original_edge_type_path,
500 # original_edge_type_list_separator,
501 # original_edge_types_column_number,
502 # original_edge_types_column,
503 # original_numeric_edge_type_ids,
504 # original_minimum_edge_type_id,
505 # original_edge_type_list_header,
506 # edge_type_list_rows_to_skip,
507 # edge_type_list_max_rows_number,
508 # edge_type_list_comment_symbol,
509 # load_edge_type_list_in_parallel=True,
510 # edge_type_list_is_correct,
511 # edge_types_number,
512 target_edge_type_list_path=target_edge_type_list_path,
513 target_edge_type_list_separator='\t',
514 target_edge_type_list_edge_types_column_number=0,
515 original_edge_path=os.path.join(
516 self._cache_path, graph_arguments["edge_path"]),
517 original_edge_list_header=graph_arguments.get(
518 "edge_list_header"
519 ),
520 original_edge_list_support_balanced_quotes=graph_arguments.get(
521 "edge_list_support_balanced_quotes"
522 ),
523 original_edge_list_separator=graph_arguments.get(
524 "edge_list_separator"
525 ),
526 original_sources_column_number=graph_arguments.get(
527 "sources_column_number"
528 ),
529 original_sources_column=graph_arguments.get(
530 "sources_column"
531 ),
532 original_destinations_column_number=graph_arguments.get(
533 "destinations_column_number"
534 ),
535 original_destinations_column=graph_arguments.get(
536 "destinations_column"
537 ),
538 original_edge_list_edge_types_column_number=graph_arguments.get(
539 "edge_list_edge_types_column_number"
540 ),
541 original_edge_list_edge_types_column=graph_arguments.get(
542 "edge_list_edge_types_column"
543 ),
544 default_edge_type=graph_arguments.get(
545 "default_edge_type"
546 ),
547 original_weights_column_number=graph_arguments.get(
548 "weights_column_number"
549 ),
550 original_weights_column=graph_arguments.get(
551 "weights_column"
552 ),
553 default_weight=graph_arguments.get(
554 "default_weight"
555 ),
556 original_edge_list_numeric_node_ids=graph_arguments.get(
557 "edge_list_numeric_node_ids"
558 ),
559 skip_weights_if_unavailable=graph_arguments.get(
560 "skip_weights_if_unavailable"
561 ),
562 skip_edge_types_if_unavailable=graph_arguments.get(
563 "skip_edge_types_if_unavailable"
564 ),
565 edge_list_comment_symbol=graph_arguments.get(
566 "edge_list_comment_symbol"
567 ),
568 edge_list_max_rows_number=graph_arguments.get(
569 "edge_list_max_rows_number"
570 ),
571 edge_list_rows_to_skip=graph_arguments.get(
572 "edge_list_rows_to_skip"
573 ),
574 load_edge_list_in_parallel=True,
575 remove_chevrons=graph_arguments.get(
576 "remove_chevrons"
577 ),
578 remove_spaces=graph_arguments.get(
579 "remove_spaces"
580 ),
581 edges_number=graph_arguments.get("edges_number"),
582 target_edge_path=target_edge_path,
583 target_edge_list_separator='\t',
584 sort_temporary_directory=self._sort_tmp_dir,
585 directed=self._directed,
586 verbose=self._verbose > 0,
587 name=self._name,
588 )
589 except Exception as e:
ValueError: Cannot open the file at graphs/kghub/KGIDG/20220722/KG-IDG/merged-kg_nodes.tsv
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
Input In [32], in <cell line: 2>()
1 from grape.datasets.kghub import KGIDG
----> 2 g = KGIDG(version='20220722')
File ~/anaconda3/lib/python3.9/site-packages/ensmallen/datasets/kghub.py:159, in KGIDG(directed, preprocess, bioregistry, load_nodes, load_node_types, load_edge_types, load_edge_weights, auto_enable_tradeoffs, sort_tmp_dir, verbose, ring_bell, cache, cache_path, cache_sys_var, version, **kwargs)
95 def KGIDG(
96 directed=False, preprocess="auto", bioregistry=False, load_nodes=True, load_node_types=True,
97 load_edge_types=True, load_edge_weights=True, auto_enable_tradeoffs=True,
98 sort_tmp_dir=None, verbose=2, ring_bell=False, cache=True, cache_path=None,
99 cache_sys_var="GRAPH_CACHE_DIR", version="current", **kwargs
100 ) -> Graph:
101 """Return KG-IDG graph
102
103 Parameters
(...)
157
158 """
--> 159 return RetrievedGraph(
160 "KGIDG", version, "kghub", directed, preprocess, bioregistry, load_nodes,
161 load_node_types, load_edge_types, load_edge_weights, auto_enable_tradeoffs, sort_tmp_dir,
162 verbose, ring_bell, cache, cache_path, cache_sys_var, kwargs
163 )()
File ~/anaconda3/lib/python3.9/site-packages/ensmallen/datasets/graph_retrieval.py:590, in RetrievedGraph.__call__(self)
414 (
415 node_types_number,
416 nodes_number,
(...)
587 name=self._name,
588 )
589 except Exception as e:
--> 590 raise RuntimeError(
591 f"Something went wrong while preprocessing the graph {self._name}, "
592 f"version {self._version}, "
593 f"retrieved from the {self._repository} repository. "
594 "This is NOT the loading step, but a preprocessing step "
595 "that loads remote data from third parties. "
596 "As such there may have been some changes in the remote data "
597 "that may have made them incompatible with the current "
598 "expected parametrization. "
599 "Do open up an issue in the Ensmallen's GitHub repository reporting also the complete"
600 "exception of this error to help us keep the automatic graph retrieval "
601 "in good shape. Thank you!"
602 ) from e
603 # Store the obtained metadata
604 self.store_preprocessed_metadata(
605 node_types_number,
606 nodes_number,
607 edge_types_number,
608 edges_number
609 )
RuntimeError: Something went wrong while preprocessing the graph KGIDG, version 20220722, retrieved from the kghub repository. This is NOT the loading step, but a preprocessing step that loads remote data from third parties. As such there may have been some changes in the remote data that may have made them incompatible with the current expected parametrization. Do open up an issue in the Ensmallen's GitHub repository reporting also the completeexception of this error to help us keep the automatic graph retrieval in good shape. Thank you!
The issue of that version is that it does not follow the agreed for format. Specifically, in the directory there it contains the files:
- data
- merged
- merged-kg_edges.tsv
- merged-kg_nodes.tsv
- merged
- neg_train_edges.tsv
- pos_valid_edges.tsv
- merged_graph_stats_20220722.yaml
- neg_valid_edges.tsv
Is there any particular reason for this change in data format? Are the future data releases of kghub to change the data format? Please do let me know so that I can update the relative metadata.
@caufieldjh any thoughts on this issue? looks like the file format in kg-idg changed
For several KG-IDG builds, I was storing pre-separated train+valid subgraphs on KG-Hub, but have since removed them.
Does the newest KG-IDG build (20230601) work?
If so, I can go back and fix the offending builds accordingly.
The following runs, I will get the list of versions that currently fail.
from grape.datasets.kghub import KGIDG
KGIDG(version="20230530")
The list of versions that are failing are:
- 20211029
- 20220601
- 20220606
- 20220701
- 20220722
Excellent, thanks. I'll get them fixed.