Analysis of the community friendliness of a programming language from Github and StackOverflow data
The following data was downloaded into HDFS as part of ingestion process:
Size: 102 GB
Source: https://ghtorrent.org/
Starting from root directory of user svt258
Raw
: project/data/raw/data
Cleaned
: project/data/cleaned
Profiling
: project/data/stats
Analytics
: project/data/analysis
Size: 120 GB
Source: https://archive.org/details/stackexchange
Starting from root directory of user rhn235
Raw
: project/data/raw/data
Cleaned
: project/data/cleaned
Profiling
: project/data/stats
Analytics
: project/data/analysis
.
├── build.sbt
├── data
│ ├── github_final_metrics.csv
│ └── stackoverflow_final_metrics.csv
├── project
│ ├── build.properties
│ └── target
│ ├── config-classes
│ ├── scala-2.12
│ └── streams
├── README.md
├── screenshots
├── src
│ └── main
│ └── scala
│ ├── app_code
│ │ ├── analysis_github.scala
│ │ └── analysis_stackoverflow.scala
│ ├── data_ingest
│ │ └── ingest.txt
│ ├── etl_code
│ │ ├── etl_github.scala
│ │ └── etl_stackoverflow.scala
│ ├── profiling_code
│ │ ├── profile_github.scala
│ │ └── profile_stackoverflow.scala
│ └── test_code
│ └── test.scala
└── target
├── scala-2.11
│ ├── big-data-pl_2.11-1.0.jar
│ ├── classes
│ └── update
└── streams
The data
folder contains the final computed metrics each for GitHub and StackOverflow.
The compiled jar
file can be found at location target/scala-2.11/
.
Starting at the root of the project folder, perform the following steps to run the application.
module load spark/2.4.0
module load git/1.8.3.1
module load sbt/1.2.8
module load scala/2.11.8
sbt compile
Once finished, the compiled classes will be saved to target/scala-2.11/classes
directory.
sbt package
Once finished, the compiled jar file named big-data-pl_2.11-1.0.jar
will be saved to target/scala-2.11/
directory.
spark2-submit --name "<your job name>" --class <your main class> --master yarn --deploy-mode cluster --verbose --driver-memory 5G --executor-memory 2G --num-executors 10 target/scala-2.11/<your JAR>.jar
spark2-shell --name "<your job name>"" --master yarn --deploy-mode client --verbose --driver-memory 5G --executor-memory 2G --num-executors 20
spark2-submit --name "TEST" --class test.Test --master yarn --deploy-mode cluster --verbose --driver-memory 5G --executor-memory 2G --num-executors 10 target/scala-2.11/big-data-pl_2.11-1.0.jar
A random job just to test is spark-submit is working.
spark2-submit --name "GH_ETL" --class etl.TransformGithubRaw --master yarn --deploy-mode cluster --verbose --driver-memory 5G --executor-memory 2G --num-executors 20 target/scala-2.11/big-data-pl_2.11-1.0.jar
Reads data from the Raw
data path in HDFS (as listed above) and stores cleaned data into the Cleaned
data path.
spark2-submit --name "GH_PROFILE" --class profile.ProfileGithub --master yarn --deploy-mode cluster --verbose --driver-memory 5G --executor-memory 2G --num-executors 20 target/scala-2.11/big-data-pl_2.11-1.0.jar
Reads data from the Cleaned
data path in HDFS (as listed above) and stores tables generated with profiling info into the Profiling
data path.
spark2-submit --name "GH_ANALYZE" --class analysis.AnalyzeGithub --master yarn --deploy-mode cluster --verbose --driver-memory 5G --executor-memory 2G --num-executors 20 target/scala-2.11/big-data-pl_2.11-1.0.jar
Reads data from the Cleaned
data path in HDFS (as listed above) and stores the computed metrics data into the Analytics
data path.
spark2-submit --packages com.databricks:spark-xml_2.11:0.9.0 --name "SO_ETL" --class etl.TransformStackOverflowRaw --master yarn --deploy-mode cluster --verbose --driver-memory 5G --executor-memory 2G --num-executors 20 target/scala-2.11/big-data-pl_2.11-1.0.jar
Reads data from the Raw
data path in HDFS (as listed above) and stores cleaned data into the Cleaned
data path.
spark2-submit --packages com.databricks:spark-xml_2.11:0.9.0 --name "SO_PROFILE" --class etl.ProfileStackOverflow --master yarn --deploy-mode cluster --verbose --driver-memory 5G --executor-memory 2G --num-executors 20 target/scala-2.11/big-data-pl_2.11-1.0.jar
Reads data from the Cleaned
data path in HDFS (as listed above) and stores tables generated with profiling info into the Profiling
data path.
spark2-submit --packages com.databricks:spark-xml_2.11:0.9.0 --name "SO_ANALYZE" --class etl.AnalyzeStackOverflow --master yarn --deploy-mode cluster --verbose --driver-memory 5G --executor-memory 2G --num-executors 20 target/scala-2.11/big-data-pl_2.11-1.0.jar
Reads data from the Cleaned
data path in HDFS (as listed above) and stores the computed metrics data into the Analytics
data path.
scala - 2.11.8
spark - 2.3.0.cloudera4
# Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_162
Cluster UI: http://babar.es.its.nyu.edu:8088/cluster