arnaudj / mooc-spark-coursera-bigdata-analysis-spark-epfl

Assignments code for 'Big Data Analysis with Scala and Spark' course (Coursera EPFL)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Big Data Analysis with Scala and Spark

Assignments code for Big Data Analysis with Scala and Spark course (Coursera EPFL)

Assignments

Final Grade 100%

  • Week 1: Wikipedia

Your overall score for this assignment is 10.00 out of 10.00

  • Week 2-3: StackOverflow

Your overall score for this assignment is 10.00 out of 10.00

  • Week 4: Time usage

Your overall score for this assignment is 10.00 out of 10.00

Details

Week 2-3: StackOverflow

Using the Spark web UI, we visualize the events timeline and DAGs.

Extracting vectors

  • Stages 1 and 2: load questions and answers.

  • Stage 3: groupedPostings, scoredPostings, vectorPostings

  • Stage 4: sampleVectors

vectors

K-Means clustering

  • Jobs 2 to 46 apply the k-means algorithm on the sampleVectors cached in previous step.

For each step, centroids are updated and collected to the driver to evaluate convergence, stopping when it is reached.

kmeans

Week 4: Time usage

The data set analyzed originates from the American Time Use Survey (ATUS) data, from 2003-2015, via Kaggle. It measures how people divide their time among misc life activities.

Displaying data with Zeppelin

We load the resulting dataset in Apache Zeppelin

Install
  • Wget archive from website

  • Untar

  • Run: SPARK_LOCAL_IP=127.0.0.1 zeppelin-0.7.1/bin/zeppelin-daemon.sh start

  • Stop: zeppelin-0.7.1/bin/zeppelin-daemon.sh stop

nb: SPARK_LOCAL_IP is set to workaround a port unable to bind exception on 0.7.1

Prepare data export

Export the resulting week 4 dataset as JSON.

1) From the Spark environment, export data to disk:

finalDf.coalesce(1) // (1)
 .write.json("dataset-week4.json")
  1. Repartition to obtain only 1 output file (else 1 per partition)

2) Upload to the host running Zeppelin, or wget from it (%sh then wget …​)

Zeppelin

Connect to the Zeppelin web UI on http://localhost:8080, and create a new notebook with the following content.

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val sqlData = sqlContext.jsonFile("dataset-week4.json")
sqlData.registerTempTable("data")
%sql SELECT * FROM data ORDER BY work DESC

Display as bar graph:

bar graph

nb: sort order seems not to be respected, as per open issue ZEPPELIN-87

About

Assignments code for 'Big Data Analysis with Scala and Spark' course (Coursera EPFL)


Languages

Language:Scala 100.0%