node2vec
Spark+scala implementation of neighborhood sampling and feature learning for large graphs:
This repository includes the implementation of node2vec.
Features
- Scalable node2vec
- Second-order random walk
- Memory-efficient graph data structure introduced @ Spark Summit 2017 "Random Walks on Large Scale Graphs with Apache Spark" presented by Min Shen (LinkedIn)
- Leverages graph partition information in order to optimize the communication and to speed up the random walk computation
- Compatible with Spark-JobServer
Requirements
- Scala 2.11 or later.
- Maven 3+
- Java 8+
- Apache Spark 2.2.0 or later.
- (Optional): Spark-JobServer 0.8.0 or later.
Quick Setup
-
Get the random walk application source code:
- using git: ' git clone git@github.com:data61/stellar-random-walk.git '
- using http: download the zip file and unzip it.
-
Go to the source code directory. A pre-built jar file, named randomwalk-0.0.1-SNAPSHOT.jar, is available at ./target. To run the application, you use this jar file. (If you want to build the jar file from the source code, you need to have Apache Maven installed and run:
mvn clean package
) -
Download Apache Spark 2.2.0 or later (e.g release 2.2.1, pre-built for apache hadoop 2.7)
Run the Application Using Spark (local machine)
To run the application on your machine, you can use spark-submit script. Go to the Apache Spark directory. Run the application with the following command:
bin/spark-submit --class au.csiro.data61.randomwalk.Main [random walk dir]/target/randomwalk-0.0.1-SNAPSHOT.jar
Run the Application Using Spark Job-server
-
make sure that the prerequisites are installed:
- Apache spark (e.g release 2.2.1, pre-built for apache hadoop 2.7)
- Java Virtual Machine (e.g. 9.0.1)
- sbt
-
git clone job-server
-
Create a
.bashrc
with the paths to JVM, sbt and spark, e.g., for Mac OS users it will be the following:export SBT=/usr/local/Cellar/sbt/1.1.0 export SPARK_HOME=~/spark-2.2.1-bin-hadoop2.7 export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk-9.0.1.jdk export PATH=$JAVA_HOME/bin:$SBT/bin:$PATH
-
Run
source .bashrc
-
Go to spark folder and run
start-all.sh
:
cd spark-2.2.1-bin-hadoop2.7/
sbin/sbin/start-all.sh
- From spark Job-server run sbt shell:
cd spark-jobserver/job-server/
sbt
-
Once inside sbt shell, run
reStart
. -
Go to
http://localhost:8090/
and make sure that Spark Job Server UI is working (Note: in Chrome binaries were not updated properly, while in Firefox it was ok) -
upload randomwalk jar to the server:
curl --data-binary @randomwalk/target/randomwalk-0.0.1-SNAPSHOT.jar localhost:8090/jars/randomwalk
-
submit a job:
curl -d "rw.input = --cmd randomwalk --numWalks 1 --p 1 --q 1 --walkLength 10 --rddPartitions 10 --directed false --input [random walk dir]/src/test/resources/karate.txt --output [output dir] --partitioned false" 'localhost:8090/jobs?appName=randomwalk&classPath=au.csiro.data61.randomwalk.Main'
-
Check the status in the Spark Job-server UI
Application Options
The following options are available:
--walkLength <value> walkLength: 80
--numWalks <value> numWalks: 10
--p <value> return parameter p: 1.0
--q <value> in-out parameter q: 1.0
--rddPartitions <value> Number of RDD partitions in running Random Walk and Word2vec: 200
--weighted <value> weighted: true
--directed <value> directed: false
--w2vPartitions <value> Number of partitions in word2vec: 10
--input <value> Input edge file path: empty
--output <value> Output path: empty
--cmd <value> command: node2vec
--partitioned <value> Whether the graph is partitioned: false
--lr <value> Learning rate in word2vec: 0.025
--iter <value> Number of iterations in word2vec: 10
--dim <value> Number of dimensions in word2vec: 128
--window <value> Window size in word2vec: 10
For example:
bin/spark-submit --class au.csiro.data61.randomwalk.Main ./randomwalk/target/randomwalk-0.0.1-SNAPSHOT.jar \
--cmd randomwalk --numWalks 1 --p 1 --q 1 --walkLength 10 --rddPartitions 10 \
--input [input edge list] --output [output directory] --partitioned false
Graph File Format
The input graph must be an edge list with integer vertex IDs. For example:
src1-id dst1-id
src1-id dst2-id
...
If the graph is weighted, it must include the weight in the last column for each edge. For example:
src1-id dst1-id 1.0
If the graph is partitioned, each edge should have a partition number, i.e., should be assigned to a partition. The partition number must be in the third column of the edge list. For example:
src1-id dst1-id 1 1.0
src1-id dst2-id 1 1.0
src3-id dst1-id 2 1.0
...
The application itself will replicate (cut) those vertices that span among multiple partitions.
References
- (Grover, Aditya, and Jure Leskovec. "node2vec: Scalable feature learning for networks." Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2016.).
- Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).
- Random Walks on Large Scale Graphs with Apache Spark