This is a simple word count job written in Scala for the Spark cluster computing platform. This project is inspired by Snowplow Analytics's Spark Example Project and Typesafe Activator Spark Sample Project.
See also:
Assuming you already have [SBT] [sbt] installed:
$ git clone git://github.com/soulmachine/spark-example-project.git
$ cd spark-example-project
$ sbt clean package
The 'fat jar' is now available as:
target/spark-example-project-1.0.jar
The assembly
command above runs the test suite - but you can also run this manually with:
$ sbt test
<snip>
[info] + A WordCount job should
[info] + count words correctly
[info] Passed: : Total 1, Failed 0, Errors 0, Passed 1, Skipped 0
Run sbt gen-idea
to create Idea project files, and click File->Open...
to open the project's root folder then you're all set.
spark-submit --class me.soulmachine.spark.WordCount --master yarn://ip-or-host:7077 ./spark-example-project-1.0.jar wordcount-test/input wordcount-test/outputs
People never write code right at one time, so debugging is extremely important. To debug your spark program locally, you need to do:
-
Change the
val sparkDependencyScope = "provided"
toval sparkDependencyScope = "compile"
inbuild.sbt
When you run you Spark program on a real Spark cluster, the Spark related jars are provided by the Spark cluster. When you run it locally, your machine doesn't have Spark related jars, so you need to change
provided
tocompile
. -
Append
.setMaster("local[2]")
to yourSparkConf
When you your Spark program locally, you don't have a Spark master, so you need to run it in local mode, by using the string
"local"
as the master. The[2]
indicats to use 2 threads.
linter (https://github.com/HairyFotr/linter)
Usage: automatically runs during Compilation and evaluation in console
sbt-scapegoat (https://github.com/sksamuel/sbt-scapegoat)
Usage: automatically runs during Compilation
Open target/scala-2.11/scapegoat.xml or target/scala-2.11/scapegoat.html
Usage: sbt scalastyle
Open target/scalastyle-result.xml
Check level are all "warn", change to "error" if you want to reject code changes when integrated with CI tools.