Spark Live Demo

Introduction

This is for the "Spark Live Demo Section" of the Big Data Analytics in Scala group meeting at http://www.meetup.com/HK-Functional-programming/events/179181332/. Slides available at https://linusyang.github.io/sparkdemo/slides/.
This demo is originally made by Nick Pentreath at http://mlnick.github.io/blog/2013/04/01/movie-recommendations-and-more-with-spark/.
For more details about the calculation of correlation between movies, which we didn't talk clearly in the meeting, please refer to http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/.
Movie datasets are from MovieLens 100k dataset, which can be found at http://grouplens.org/datasets/movielens/.

Install the latest Spark releases on your node. And if you need to build the Scala code, you need to install sbt.
Setup your Spark cluster by this tutorial (Standalone mode is recommended).
Clone this repository on every node of your Spark cluster at the same location by running git clone https://github.com/linusyang/sparkdemo.git && cd sparkdemo/.
Setup the configuration by editing the Makefile (you should have the make utility first):
- SPARK_HOME: Directory where Spark is installed
- SPARK_MASTER: URL (spark://) of Spark master node
- SPARK_MEMORY: Memory size used for every Spark worker node
Startup the Spark service by running make up.
Run the demo by either interactively in a shell by make or in a batch by make run:
- make: If you run in the shell, type new Worker().run(sc) or type new Worker().read(sc).pair().calc().show() to get the result.
- make run: Or if you run in a batch, you will directly get the result.
If you want to stop the Spark cluster when finished the demo, use make down to shutdown all Spark instances.

Licensed under GPLv3.