load graph from a Pandas Dataframe
mkarmona opened this issue · comments
Big data engineering processes using Apache Spark produce triple sets. To avoid tedious IO serialisation and coalescing to/from CSV files PySpark provides toPandas()
method. This method collects the partitioned and distributed dataset into the local memory of the driver node and make it accessible as Pandas data frame.
Thus, having a graph constructor straight from already produced data frames will be really convenient.
pedges = edges.toPandas()
pnodes = nodes.toPandas()
g = (Graph.from_pd(directed=True,
node_path=pnodes,
nodes_column_number=0,
node_list_node_types_column_number=1,
edge_path=pedges,
sources_column_number=0,
destinations_column_number=2,
edge_list_edge_types_column_number=4,
weights_column_number=11)
.remove_components(top_k_components=1)
)
Are the columns numeric, or do you expect them to contain strings?
All numeric
0 -4247916474242508806 111669168462 122270047432111156 7558004278719340179 4 0 508 22987 0 68 3 3
0 -4247916474242508806 120259094189 -4247916474242508806 2024546716798971474 2 0 508 841 0 9 2 2
1 3321359613095714626 34359742828 3321359613095714626 1021329052062355964 6 0 161 15561 1 10 1 1
1 3321359613095714626 94489307989 4459994629667120731 2024546716798971474 1 0 161 100 1 0 1 1
Could you also provide an example of your node list?
sure! node id and node class id
42949702932 -4247916474242508806
333 -4247916474242508806
120259105872 122270047432111156
34359758249 8082227106116270368
34359751343 -4247916474242508806
103079232639 4459994629667120731
4201 -4247916474242508806
85899365480 3321359613095714626
4772 122270047432111156
Why are there negative values?
For node IDs those come from a function to generate unique number at scale for a long list of them. For classes cardinality is small so I use a numeric hash function xxhash64
.
I've implemented from_pd
and this is an example of the usage:
nodes_df = pd.DataFrame(
[("a", "user"), ("b", "user"), ("c", "product")],
columns=["name", "type"],
)
edges_df = pd.DataFrame(
[("a", "b", 1.0, "knows"), ("b", "c", 2.0, "bought")],
columns=["subject", "object", "weight", "predicate"],
)
graph = Graph.from_pd(
edges_df,
nodes_df,
node_name_column="name",
node_type_column="type",
edge_src_column="subject",
edge_dst_column="object",
edge_weight_column="weight",
edge_type_column="predicate",
directed=True,
name="graph",
)
Would this be ok? We are still debugging it, but if it's ok we can publish a new version soon.
@LucaCappelletti94 thanks for your prompt reply and implementation! Can I assume column types are automatically extracted from the Pandas dataframe? If I guess so, then this parametrised interface works, indeed. Happy to check on my side as soon as the version is out.
Hi @mkarmona - What do you mean by column types? If you refer to the data type of the node IDs, you seem to be using a i64. That cannot be used as a data type for a dense numeric range, as we compile ensmallen to use a u32. The use of a sparse range which includes negative values forces us to cast these node ids to strings. For numeric node ids to be used, they would need to be a dense positive range, from 0 to number of nodes.
Yeah, in the current implementation everything is treated as a string regardless of the type
Just curios if this made it to a release yet ;)
On Linux and macOS yes, but not on windows.
@LucaCappelletti94 @zommiommy thanks a lot for implementing this feature. I can confirm it works for me. This issue is done so please close it as you wish.