A simple project intended to demo spark and get developers up and running quickly
Note: This project uses Gradle. You must install Gradle(1.12). If you would rather not install Gradle locally you can use the Gradle Wrapper by replacing all refernces to
gradle
withgradlew
.
- Execute
gradle build
- Find the artifact jars in './build/libs/'
- Execute
gradle idea
- Open project folder in Intellij or open the generated .ipr file
Note: If you have any issues in Intellij a good first troubleshooting step is to execute
gradle cleanIdea idea
- Execute
gradle eclipse
- Open the project folder in Eclipse
Note: If you have any issues in Eclipse a good first troubleshooting step is to execute
gradle cleanEclipse eclipse
Note: This guide has only been tested on Mac OS X and may assume tools that are specific to it. If working in another OS substitutes may need to be used but should be available.
- Run
gradle build
The demos generally take the first argument as the Spark Master URL. Setting this value to 'local' runs the demo in local mode. The trailing number in the brackets '[#]' indicates the number of cores to use. (ex. 'local[2]' runs locally with 2 cores)
This project has a Gradle task called 'runSpark' that manages the runtime classpath for you. This simplifies running spark jobs, ensures the same classpath is used in all modes, and shortens the development feedback loop.
The 'runSpark' Gradle task takes two arguments '-PsparkMain' and '-PsparkArgs':
- -PsparkMain: The main class to run.
- -PsparkArgs: The arguments to be passed to the main class. See the class for documentation and what arguments are expected.
Below are some sample commands for some simple demos:
- SparkPi:
gradle runSpark -PsparkMain="com.cloudera.sa.SparkPi" -PskipHadoopJar -PsparkArgs="local[2] 100"
- Sessionize:
gradle runSpark -PsparkMain="com.cloudera.sa.Sessionize" -PskipHadoopJar -PsparkArgs="local[2]"
- HdfsWordCount:
gradle runSpark -PsparkMain="com.cloudera.sa.HdfsWordCount" -PskipHadoopJar -PsparkArgs="local[2] streaming-input"
- NetworkWordCount:
gradle runSpark -PsparkMain="com.cloudera.sa.NetworkWordCount" -PskipHadoopJar -PsparkArgs="local[2] localhost 9999"
Note: The remaining steps are only required for running demos in "pseudo-distributed" mode and on a cluster.
- Install Spark 1.0 using Homebrew:
brew install apache-spark
- Add SPARK_HOME to your .bash_profile:
export SPARK_HOME=/usr/local/Cellar/apache-spark/1.0.0/libexec
- Add SCALA_HOME and JAVA_HOME to your .bash_profile
Note: You may also install on your own following the Spark Documentation
- The defaults should work for now. However, See Cluster Launch Scripts documentation for more information on configuring your pseudo cluster.
- Start your Spark cluster:
$SPARK_HOME/sbin/start-all.sh
- Validate the master & worker are running in the Spark Master WebUI
- Note the master URL on the Spark Master WebUI. It will be used when submitting jobs.
- Shutdown when done:
$SPARK_HOME/sbin/stop-all.sh
Running in pseudo-distributed mode is almost exactly the same as local mode. Note: Please see step 2 before continuing on.
To run in pseudo-distributed mode just replace 'local[#]' in the Spark Master URL argument with the URL from Step 4.
Below are some sample commands for each demo:
Note: You will need to substitute in your Spark Master URL
- SparkPi:
gradle runSpark -PsparkMain="com.cloudera.sa.SparkPi" -PsparkArgs="spark://example:7077 100"
- Sessionize:
gradle runSpark -PsparkMain="com.cloudera.sa.Sessionize" -PsparkArgs="spark://example:7077"
- HdfsWordCount:
gradle runSpark -PsparkMain="com.cloudera.sa.HdfsWordCount" -PsparkArgs="spark://example:7077 streaming-input"
- NetworkWordCount:
gradle runSpark -PsparkMain="com.cloudera.sa.NetworkWordCount" -PsparkArgs="spark://example:7077 localhost 9999"
The build creates a fat jar tagged with '-hadoop' that contains all dependencies needed to run on the cluster. The jar can be found in './build/libs/'.
TODO: Test this and fill out steps.
Develop demos of your own and send a pull request!
- Create trait/class with generic context, smart defaults, and unified arg parsing (see spark-submit script for ref)
- Document whats demonstrated in each demo (avro, parquet, kryo, etc) and usage
- Add module level readme and docs
- Tune logging output configuration (Redirect verbose logs into a rolling file)
- Speed up HadoopJar task (and runSpark will follow)
- Kafka Consuming/Producing
- Live upgrade of streaming application. See Streaming Deployment