aws emr spark scala graph-algorithms network-analysis gephi graph-visualization large-scale almond

CS6240 Group 10 Project

Fall 2020

Visit https://masonleon.github.io/largescale-spark-graph-analytics/ for additional project information.

Code author

April Gustafson, Mason Leon, Matthew Sobkowski

Installation

These components are installed:

OpenJDK 1.8.0_265
Scala 2.11.12
Hadoop 2.9.1
Spark 2.3.1 (without bundled Hadoop)
Maven 3.6.3
AWS CLI (for EMR execution)

Dataset

https://snap.stanford.edu/data/soc-LiveJournal1.html

To download to input dir:

```
bash ./data-download.sh
```

Environment

Example ~/.bash_aliases:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=$HADOOP_HOME/hadoop/hadoop-2.9.1  
export SCALA_HOME=$SCALA_HOME/scala/scala-2.11.12  
export SPARK_HOME=$SPARK_HOME/spark/spark-2.3.1-bin-without-hadoop  
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop  
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SCALA_HOME/bin:$SPARK_HOME/bin  
export SPARK_DIST_CLASSPATH=$(hadoop classpath)

Explicitly set JAVA_HOME in $HADOOP_HOME/etc/hadoop/hadoop-env.sh:
```
 export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 
```
[Optional] Setup Docker Environment
https://docs.docker.com/get-docker/

Execution

All of the build & execution commands are organized in the Makefile.

Unzip project file.
Open command prompt.
Navigate to directory where project files unzipped.
Edit the Makefile to customize the environment at the top.
Sufficient for standalone: hadoop.root, jar.name, local.input
Other defaults acceptable for running standalone.
Standalone Hadoop:
make switch-standalone -- set standalone Hadoop environment (execute once)
make local
Pseudo-Distributed Hadoop: (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation)
make switch-pseudo -- set pseudo-clustered Hadoop environment (execute once)
make pseudo -- first execution
make pseudoq -- later executions since namenode and datanode already running
AWS EMR Hadoop: (you must configure the emr.* config parameters at top of Makefile)
make upload-input-aws -- only before first execution
make aws -- check for successful execution with web interface (aws.amazon.com)
download-output-aws -- after successful execution & termination
Docker Jupyter Scala/Spark Almond Notebook: (https://github.com/almond-sh/almond) make run-container-spark-jupyter-almond -- run docker container with scala + spark kernel for local standalone copy token from terminal and paste in browser http://127.0.0.1:8888/?token=<TOKEN_FROM_TERMINAL>
Docker Standalone Hadoop/Spark make run-container-spark-jar-local -- run docker container environment with compiled .jar app make run-container-spark-jar-local 2>&1 | tee logs/logfile.log -- run docker container environment with compiled .jar app and redirect standard error+output to log

About

Group 10 Project, Fall 2020, CS 6240: Large-Scale Parallel Data Processing, Khoury College of Computer Sciences, Northeastern University

https://masonleon.github.io/largescale-spark-graph-analytics/

aws emr spark scala graph-algorithms network-analysis gephi graph-visualization large-scale almond

Apache License 2.0

Languages

Language:Scala 68.5%Language:Makefile 16.3%Language:Dockerfile 13.1%Language:Shell 2.1%