Graph Comparison

Introduction
Project structure
Implementation
Initialize project in your machine
Table of results
References

Introduction

The purpose of this project is to compare a graph class and its method against the state of art Pyspark class for graph operations (GraphFrame). Compare the results, in time of execution and show the superiority of parallel computing against not parallelized data structures when dealing with high dimensional data. Also, the project has the purpose of showing the relationship between the number of workers nodes and computation times for graph-related problems. The dataset used is the Twitch-Social-Network dataset [1] of Standard university. In particular, almost all the tests are run on the twitch/ENGB dataset which is composed of 7,126 nodes and 35,324 edges. In addition to these results more test has been done with the twitch/DE dataset which has 153,138 edges and 9498 nodes.

Project structure

All the tests and experimented were performed on collaboratory with the GraphMethodsComparison.ipynb file before the landing on the AWS platform of the project. Starting from that file all the stuff has been divided into 3 main classes:

StandardGraph.py: This class is a personal implementation of a graph in Python using nodes adjacency list and an additional hashmap to carry nodes attributes and values.
SparkGraph.py: This class is the parallelized version of the graph. It has only one attribute which is a spark GraphFrame object. Thanks to the init it is possible to construct directly the graph from the target.csv and edges.csv (check their headers and composition to see if you can fit your dataset).
GraphComparison.py: This class is the comparator for the two graphs. It needs an instance of StandardGraph, one of SparkGraph and the spark context to be initialized. Inside this function there are all the methods to run and compare the following graph methods:
- Retrieve nodes that satisfy a particular query.
- Retrieve the node that has the max value of a given parameter.
- Retrieve connected components.
- Retrieve strongly connected components.
- Count the number of triangles.
- Get the indegree of every node of the graph.
- Get the shortest path between two nodes.

All of these methods are differently implemented for the StandardGraph (which doesn't need to exploit parallelization) and GraphFrame. The timing of every method execution is recorded in the Results/.csv file in order to compare them and draw conclusions.

Main.py is just a main, it does initializations and runs all the graphComparison methods.

Implementation

Every method used for the standardGraph is a well-known algorithm.

BFSQuery(): Just a breadth first search in the entire graph, if the nodes satisfy the query it gets added to the result time complexity: O(v)
nodeWithMaxValueOfAttribute(): Same principle of BFSQuery() time complexity: O(v)
connectedComponents(): Retrived using a DFS search helper, time complexity: O(v+e)
stronglyConnectedComponents(): Tarjan’s Algorithm implementation, time complexity: O(v+e).
countTriangles(): Without matrix trace approach time complexity: O(v^3).
indegree(): Get a map node: indegree for every node in the graph. O(V*max(number of edges of a node))
shortestPath(): Get the shortest path between two nodes. O(VE)

Where v # of vertices and e # of edges.

The methods of the SparkGraph are the ones present in the GraphFrame python library. You can have more information about the methods used in this project and many more in the GraphFrame documentation: https://graphframes.github.io/graphframes/docs/_site/user-guide.html

Regarding the AWS setup, it has been done using Terraform and using a good project from a friend and colleague that you can find at [2]. With some tweaks in the parameters and following the guideline is possible to land to terraform safely.

You can find more about the implementation, results, and conclusions in the GraphComparisonReport.pdf file.

Initialize project in your machine

Download Terraform from their website and install on your machine.
Download the terraform project from here and unzip it. (Go on the repository, press the button code->download rar)
Enter in the folder "spark-terraform-master/" you have just extracted.
Create a file named "terraform.tfvars" and paste this:

access_key="<YOUR AWS ACCESS KEY>"
secret_key="<YOUR AWS SECRET KEY>"
token="<YOUR AWS TOKEN>"

Substitute the values inside the "" with your aws access key, secret key and aws token.

If you are using amazon educate you can retrive your values in the vocareum page you get after having logged in by clicking on the button "account details" under the voice "amazon CLI".
If you are using the normal aws please follow the guide on AWS DOCS in the paragraph called "Generate access keys for programmatic access".

Note: without setting the other variables (you can find it on variables.tf), terraform will create a cluster on the region "us-east-1", with 1 namenode, 6 datanode and with an instance type of m5.xlarge.

Download THIS repository and unzip it. (Scroll up the page, press the button code->download rar)
Take all the files inside the folder "GraphComparison-main" you have just downloaded and put all of them into the "app" folder which is inside the spark-terraform-master folder you have downloaded in step 1. (e.g. main.py should be in spark-terraform-master/app/main.py and so on for all the other files)
Open a terminal inside the /spark-terraform-master/ folder and generate a new ssh-key with an empty passphrase

ssh-keygen -f localkey

Login to AWS and create a key pair named amzkey in PEM file format. Follow the guide on AWS DOCS. Download the key and put it in the spark-terraform-master/ folder. With the same terminal of step 5 (located in the spark-terraform-master/ folder) execute this command to fix the permissions of the key:

chmod 500 amzkey.pem

Go to EC2->Security Group and make sure you don't have already a group called "Hadoop_cluster_sc" if you have, delete it.
From your aws account on the voice EC2->Network Interface->Create a network interface create a new subnet selecting as subnet us-east-1a. Do the same thing with all the possibile subnets (us-east-1b, us-east-1c, us-east-1d). After the creation you can check from the subnet console which subnet has the ip ranges 172.31.80.*
After you find the subnet which is associated with those addresses you need to copy the subnet id of that subnet, then Open then file main.tf located in the spark-terraform-master/ folder with a text editor and in line 109 and 41 substitute the subnet_id with your subnet id (the one which has the ip ranges 172.31.80.*)

subnet_id = "INSERT YOUR SUBNET_ID HERE"

With the same terminal of step 5/6 (located in the spark-terraform-master/ folder), execute the commands

terraform init
terraform apply

The terraform apply will show you the instances that are going to be created write yes to start the creation.
After a while (wait!) it should print some public DNS in a green color, these are the public dns of your instances. It can happen that the command doesn't work (with an error like "Connection timeout"), usually it can be solved by doing a terraform destroy and re-do the terraform apply.

You can now connect via ssh to all your instances with the command

ssh -i <PATH_TO_SPARK_TERRAFORM>/spark-terraform-master/amzkey.pem ubuntu@<PUBLIC DNS>

If Terraform for some reason didn't print the DNS of the nodes you can find the public dns of the master as the node s01 in your aws console.
11. Connect to the master and execute (one by one):

cp Jars/graphframes-0.8.1-spark3.0-s_2.12.jar /opt/spark-3.0.1-bin-hadoop2.7/jars/graphframes-0.8.1-spark3.0-s_2.12.jar
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver
$SPARK_HOME/sbin/start-master.sh
$SPARK_HOME/sbin/start-slaves.sh spark://s01:7077

You are ready to execute GraphComparison! Execute this command on the master

/opt/spark-3.0.1-bin-hadoop2.7/bin/spark-submit --master spark://s01:7077  --executor-cores 4 --executor-memory 14g main.py

Note: if you use a machine which has less resources you need to adjust this command parameters.

12a. Common error in this phase.
Based on what machine you chose you will be able to change the number of cores used and the amount of RAM allocated for the tasks. If you would like to use a dataset different from the ENGB pay attention to the output of this command; if you get this warn message:

WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

That means that you have allocated an insufficient amount of resources or other tasks have a lock on them. You can check the jobs being executed with the spark UI at the following link:

<PUBLIC DNS OF YOUR MASTER NODE>:8080

Sometimes it happens that some iteration takes much more time than the others. The causes could be 1) in the install-all.sh there are more workers defined than the real number of workers (e.g. if we are using 2 workers, we need to delete s04, s05 and s06 from lines 166 and 204 of install-all.sh) 2) aws is throttling the resources of the instances. We usually resolve these problems by destroying the instances and waiting some time before re-running them.

Remember to do terraform destroy to delete your EC2 instances

Note: The steps from 0 to 7 (included) are needed only on the first execution ever.

Table of results

Retrieve nodes that satisfy a particular query.

NoOfWorkers	GraphClass time (s)	GraphFrame time (s)	Dataset
2	0.000423431396484375	37.88420820236206	ENGB
2	0.016211986541748047	27.954243659973145	DE
3	0.0003943443298339844	32.59935522079468	ENGB
3	0.012290716171264648	16.822822332382202	DE
4	0.0004608631134033203	30.52811098098755	ENGB
4	0.012469768524169922	15.371482372283936	DE
5	0.0004055500030517578	36.94925403594971	ENGB
5	0.019950389862060547	15.282100439071655	DE
6	0.0004177093505859375	27.606682062149048	ENGB
6	0.012729883193969727	17.13811993598938	DE

Retrieve the node that has the max value of a given parameter.

NoOfWorkers	GraphClass time (s)	GraphFrame time (s)	Dataset
2	0.0006043910980224609	0.018646240234375	ENGB
2	0.019882917404174805	0.011564970016479492	DE
3	0.0004267692565917969	0.009284496307373047	ENGB
3	0.019281625747680664	0.010223388671875	DE
4	0.0005440711975097656	0.011857748031616211	ENGB
4	0.020158767700195312	0.010314226150512695	DE
5	0.00044655799865722656	0.010093927383422852	ENGB
5	0.021704673767089844	0.03195500373840332	DE
6	0.00043702125549316406	0.011041641235351562	ENGB
6	0.019765615463256836	0.007799386978149414	DE

Retrieve connected components.

NoOfWorkers	GraphClass time (s)	GraphFrame time (s)	Dataset
2	0.01049494743347168	102.73485112190247	ENGB
2	0.04446291923522949	106.41630601882935	DE
3	0.009637594223022461	87.6495943069458	ENGB
3	0.02515435218811035	46.40431880950928	DE
4	0.01388859748840332	83.48309469223022	ENGB
4	0.023485422134399414	41.4247043132782	DE
5	0.009996652603149414	81.32360482215881	ENGB
5	0.025549888610839844	41.889562129974365	DE
6	0.010695457458496094	82.51052212715149	ENGB
6	0.021820545196533203	38.586870431900024	DE

Retrieve strongly connected components.

NoOfWorkers	GraphClass time (s)	GraphFrame time (s)	Dataset
2	0.017906665802001953	15.816995620727539	ENGB.
2	0.03655409812927246	96.64561462402344	DE
3	0.018994808197021484	18.190645933151245	ENGB
3	0.06282544136047363	81.6668872833252	DE
4	0.021115779876708984	17.038374185562134	ENGB
4	0.037874460220336914	97.64344906806946	DE
5	0.019264936447143555	19.353548049926758	ENGB
5	0.06413626670837402	93.76303029060364	DE
6	0.018758773803710938	21.704362869262695	ENGB
6	0.04340491092382217	85.273492013944	DE

Count the number of triangles.

NoOfWorkers	GraphClass time (s)	GraphFrame time (s)	Dataset
2	out of time (over 20 minutes)	29.846829891204834	ENGB
2	out of time (over 20 minutes)	38.8481080532074	DE
3	out of time (over 20 minutes)	19.2566659450531	ENGB
3	out of time (over 20 minutes)	26.689303398132324	DE
4	out of time (over 20 minutes)	17.240204334259033	ENGB
4	out of time (over 20 minutes)	24.466190576553345	DE
5	out of time (over 20 minutes)	19.41680598258972	ENGB
5	out of time (over 20 minutes)	24.37866497039795	DE
6	out of time (over 20 minutes)	14.647715330123901	ENGB
6	out of time (over 20 minutes)	15.930915355682373	DE

Retrive indegree for every node.

NoOfWorkers	GraphClass time (s)	GraphFrame time (s)	Dataset
2	0.04516458511352539	0.032668787002563477	DE
2	0.03279376029968262	0.008373737335205078	ENGB
3	0.025552749633789062	0.02932000160217285	DE
3	0.04047727584838867	0.007405281066894531	ENGB
4	0.03268170356750488	0.026362895965576172	DE
4	0.03748941421508789	0.007115602493286133	ENGB
5	0.031110286712646484	0.028514862060546875	DE
5	0.04486250877380371	0.007311105728149414	ENGB
6	0.025351762771606445	0.022911901473999023	DE
6	0.06084442138671875	0.0064394474029541016	ENGB

Retrive shortest path between two nodes

NoOfWorkers	GraphClass time (s)	GraphFrame time (s)	Dataset
2	0.10671496391296387	16.647297143936157	DE
2	0.03432583808898926	2.6014223098754883	ENGB
3	0.10739398002624512	14.898767232894897	DE
3	0.0544133186340332	2.318774461746216	ENGB
4	0.10797381401062012	14.46754503250122	DE
4	0.04219651222229004	1.8270833492279053	ENGB
5	0.10746145248413086	13.523574590682983	DE
5	0.059081315994262695	1.786609172821045	ENGB
6	0.10418510437011719	11.544317722320557	DE
6	0.05588173866271973	1.4732680320739746	ENGB

References

[1] https://snap.stanford.edu/data/twitch-social-networks.html [2] https://github.com/conema/spark-terraform [3] https://github.com/giacoballoccu/spark-terraform

giacoballoccu / GraphComparison