AnacletoLAB / grape

🍇 GRAPE is a Rust/Python Graph Representation Learning library for Predictions and Evaluations

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

load graph from a Pandas Dataframe

mkarmona opened this issue · comments

Big data engineering processes using Apache Spark produce triple sets. To avoid tedious IO serialisation and coalescing to/from CSV files PySpark provides toPandas() method. This method collects the partitioned and distributed dataset into the local memory of the driver node and make it accessible as Pandas data frame.
Thus, having a graph constructor straight from already produced data frames will be really convenient.

pedges = edges.toPandas()
pnodes = nodes.toPandas() 

g = (Graph.from_pd(directed=True, 
                    node_path=pnodes,
                    nodes_column_number=0,
                    node_list_node_types_column_number=1,
                    edge_path=pedges,
                    sources_column_number=0,
                    destinations_column_number=2,
                    edge_list_edge_types_column_number=4,
                    weights_column_number=11)
       .remove_components(top_k_components=1)
    )

Are the columns numeric, or do you expect them to contain strings?

All numeric

0	-4247916474242508806	111669168462	122270047432111156	7558004278719340179	4	0	508	22987	0	68	3	3
0	-4247916474242508806	120259094189	-4247916474242508806	2024546716798971474	2	0	508	841	0	9	2	2
1	3321359613095714626	34359742828	3321359613095714626	1021329052062355964	6	0	161	15561	1	10	1	1
1	3321359613095714626	94489307989	4459994629667120731	2024546716798971474	1	0	161	100	1	0	1	1

Could you also provide an example of your node list?

sure! node id and node class id

42949702932	-4247916474242508806
333	-4247916474242508806
120259105872	122270047432111156
34359758249	8082227106116270368
34359751343	-4247916474242508806
103079232639	4459994629667120731
4201	-4247916474242508806
85899365480	3321359613095714626
4772	122270047432111156

Why are there negative values?

For node IDs those come from a function to generate unique number at scale for a long list of them. For classes cardinality is small so I use a numeric hash function xxhash64.

I've implemented from_pd and this is an example of the usage:

nodes_df = pd.DataFrame(
    [("a", "user"), ("b", "user"), ("c", "product")],
    columns=["name", "type"],
)

edges_df = pd.DataFrame(
    [("a", "b", 1.0, "knows"), ("b", "c", 2.0, "bought")],
    columns=["subject", "object", "weight", "predicate"],
)

graph = Graph.from_pd(
    edges_df,
    nodes_df,
    node_name_column="name",
    node_type_column="type",
    edge_src_column="subject",
    edge_dst_column="object",
    edge_weight_column="weight",
    edge_type_column="predicate",
    directed=True,
    name="graph",
)

Would this be ok? We are still debugging it, but if it's ok we can publish a new version soon.

@LucaCappelletti94 thanks for your prompt reply and implementation! Can I assume column types are automatically extracted from the Pandas dataframe? If I guess so, then this parametrised interface works, indeed. Happy to check on my side as soon as the version is out.

Hi @mkarmona - What do you mean by column types? If you refer to the data type of the node IDs, you seem to be using a i64. That cannot be used as a data type for a dense numeric range, as we compile ensmallen to use a u32. The use of a sparse range which includes negative values forces us to cast these node ids to strings. For numeric node ids to be used, they would need to be a dense positive range, from 0 to number of nodes.

Yeah, in the current implementation everything is treated as a string regardless of the type

Just curios if this made it to a release yet ;)

On Linux and macOS yes, but not on windows.

@LucaCappelletti94 @zommiommy thanks a lot for implementing this feature. I can confirm it works for me. This issue is done so please close it as you wish.