Copyright (c) 2014 Linus Yang & Tomas Tauber
-
This is for the "Spark Live Demo Section" of the Big Data Analytics in Scala group meeting at http://www.meetup.com/HK-Functional-programming/events/179181332/. Slides available at https://linusyang.github.io/sparkdemo/slides/.
-
This demo is originally made by Nick Pentreath at http://mlnick.github.io/blog/2013/04/01/movie-recommendations-and-more-with-spark/.
-
For more details about the calculation of correlation between movies, which we didn't talk clearly in the meeting, please refer to http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/.
-
Movie datasets are from MovieLens 100k dataset, which can be found at http://grouplens.org/datasets/movielens/.
- Install the latest Spark releases on your node. And if you need to build the Scala code, you need to install sbt.
- Setup your Spark cluster by this tutorial (Standalone mode is recommended).
- Clone this repository on every node of your Spark cluster at the same location by running
git clone https://github.com/linusyang/sparkdemo.git && cd sparkdemo/
. - Setup the configuration by editing the
Makefile
(you should have themake
utility first):SPARK_HOME
: Directory where Spark is installedSPARK_MASTER
: URL (spark://
) of Spark master nodeSPARK_MEMORY
: Memory size used for every Spark worker node
- Startup the Spark service by running
make up
. - Run the demo by either interactively in a shell by
make
or in a batch bymake run
:make
: If you run in the shell, typenew Worker().run(sc)
or typenew Worker().read(sc).pair().calc().show()
to get the result.make run
: Or if you run in a batch, you will directly get the result.
- If you want to stop the Spark cluster when finished the demo, use
make down
to shutdown all Spark instances.
Licensed under GPLv3.