load graph from a Pandas Dataframe

Question

load graph from a Pandas Dataframe

mkarmona opened this issue a year ago · comments

Big data engineering processes using Apache Spark produce triple sets. To avoid tedious IO serialisation and coalescing to/from CSV files PySpark provides toPandas() method. This method collects the partitioned and distributed dataset into the local memory of the driver node and make it accessible as Pandas data frame.
Thus, having a graph constructor straight from already produced data frames will be really convenient.

pedges = edges.toPandas()
pnodes = nodes.toPandas() 

g = (Graph.from_pd(directed=True, 
                    node_path=pnodes,
                    nodes_column_number=0,
                    node_list_node_types_column_number=1,
                    edge_path=pedges,
                    sources_column_number=0,
                    destinations_column_number=2,
                    edge_list_edge_types_column_number=4,
                    weights_column_number=11)
       .remove_components(top_k_components=1)
    )

Luca Cappelletti · Answer 1 · Fri Jun 23 2023 21:11:48 GMT+0800 (China Standard Time)

Are the columns numeric, or do you expect them to contain strings?

Miguel Carmona · Answer 2 · Fri Jun 23 2023 21:38:01 GMT+0800 (China Standard Time)

All numeric

0	-4247916474242508806	111669168462	122270047432111156	7558004278719340179	4	0	508	22987	0	68	3	3
0	-4247916474242508806	120259094189	-4247916474242508806	2024546716798971474	2	0	508	841	0	9	2	2
1	3321359613095714626	34359742828	3321359613095714626	1021329052062355964	6	0	161	15561	1	10	1	1
1	3321359613095714626	94489307989	4459994629667120731	2024546716798971474	1	0	161	100	1	0	1	1

Luca Cappelletti · Answer 3 · Fri Jun 23 2023 21:45:56 GMT+0800 (China Standard Time)

Could you also provide an example of your node list?

Miguel Carmona · Answer 4 · Fri Jun 23 2023 21:55:28 GMT+0800 (China Standard Time)

sure! node id and node class id

42949702932	-4247916474242508806
333	-4247916474242508806
120259105872	122270047432111156
34359758249	8082227106116270368
34359751343	-4247916474242508806
103079232639	4459994629667120731
4201	-4247916474242508806
85899365480	3321359613095714626
4772	122270047432111156

Luca Cappelletti · Answer 5 · Fri Jun 23 2023 21:56:15 GMT+0800 (China Standard Time)

Why are there negative values?

Miguel Carmona · Answer 6 · Fri Jun 23 2023 21:58:54 GMT+0800 (China Standard Time)

For node IDs those come from a function to generate unique number at scale for a long list of them. For classes cardinality is small so I use a numeric hash function xxhash64.

Tommaso Fontana · Answer 7 · Fri Jun 23 2023 23:36:06 GMT+0800 (China Standard Time)

I've implemented from_pd and this is an example of the usage:

nodes_df = pd.DataFrame(
    [("a", "user"), ("b", "user"), ("c", "product")],
    columns=["name", "type"],
)

edges_df = pd.DataFrame(
    [("a", "b", 1.0, "knows"), ("b", "c", 2.0, "bought")],
    columns=["subject", "object", "weight", "predicate"],
)

graph = Graph.from_pd(
    edges_df,
    nodes_df,
    node_name_column="name",
    node_type_column="type",
    edge_src_column="subject",
    edge_dst_column="object",
    edge_weight_column="weight",
    edge_type_column="predicate",
    directed=True,
    name="graph",
)

Would this be ok? We are still debugging it, but if it's ok we can publish a new version soon.

Miguel Carmona · Answer 8 · Mon Jun 26 2023 16:22:44 GMT+0800 (China Standard Time)

@LucaCappelletti94 thanks for your prompt reply and implementation! Can I assume column types are automatically extracted from the Pandas dataframe? If I guess so, then this parametrised interface works, indeed. Happy to check on my side as soon as the version is out.

Luca Cappelletti · Answer 9 · Mon Jun 26 2023 16:50:57 GMT+0800 (China Standard Time)

Hi @mkarmona - What do you mean by column types? If you refer to the data type of the node IDs, you seem to be using a i64. That cannot be used as a data type for a dense numeric range, as we compile ensmallen to use a u32. The use of a sparse range which includes negative values forces us to cast these node ids to strings. For numeric node ids to be used, they would need to be a dense positive range, from 0 to number of nodes.

Tommaso Fontana · Answer 10 · Mon Jun 26 2023 16:55:12 GMT+0800 (China Standard Time)

Yeah, in the current implementation everything is treated as a string regardless of the type

Hüseyin Oktay · Answer 11 · Wed Aug 02 2023 05:29:14 GMT+0800 (China Standard Time)

Just curios if this made it to a release yet ;)

Luca Cappelletti · Answer 12 · Wed Aug 02 2023 16:38:31 GMT+0800 (China Standard Time)

On Linux and macOS yes, but not on windows.

Miguel Carmona · Answer 13 · Tue Oct 03 2023 16:38:55 GMT+0800 (China Standard Time)

@LucaCappelletti94 @zommiommy thanks a lot for implementing this feature. I can confirm it works for me. This issue is done so please close it as you wish.