Notes for setting up Spark Notebook example

Run the following commands:

docker pull andypetrella/spark-notebook:0.7.0-scala-2.11.8-spark-2.0.2-hadoop-2.7.2
docker pull cassandra
docker run --name my-cassandra -d cassandra:latest # or docker start my-cassandra if already present
docker run -p 9001:9001 --link my-cassandra andypetrella/spark-notebook:0.7.0-scala-2.11.8-spark-2.0.1-hadoop-2.7.2

Access Spark Notebook
Drag and drop Twitter to Cassandra.snb on to the Spark Notebook files tab
Run docker ps | grep my-cassandra | cut -f1 -d" " to determine Cassandra instance hostname. Note: The IP address for the Cassandra instance is in the environment variables for the Spark Notebook instance. This is true if --link is used when running Spark Notebook. For example, MY_CASSANDRA_PORT_9042_TCP_ADDR, value: 172.17.0.2.

Set up metadata (Edit -> Edit Metadata)

"customLocalRepo": "/tmp/repo",
  "customDeps": [
    "com.datastax.spark % spark-cassandra-connector_2.11 % 2.0.0-M3",
    "org.apache.bahir % spark-streaming-twitter_2.11 % 2.0.1"
  ],
  "customSparkConf": {
    "spark.cassandra.connection.host": "<cassandra-hostname-here>"
  },

Execute cells in notebook

Reference

Github repo for Spark Notebook
From O'Reilly class: docker run --rm -it -m 8g --net=host datafellas/distributed-pipeline-quotes:2.0.1 bash
Using Spark Twitter libraries
Example project of sentiment analysis on Tweets using Spark
https://github.com/andypetrella/spark-notebook/blob/master/docs/metadata.md#custom-variables
Also, see configuration example from training-exercices-Assessment 1.snb
Useful info in both training-exercices-Assessment 1.snb and training-exercices-Assessment 2.snb
http://ampcamp.berkeley.edu/3/exercises/realtime-processing-with-spark-streaming.html
http://blog.rackspace.com/behind-the-curtain-twitter-sentiment-analysis-demo-sample-app-code
https://dev.twitter.com/streaming/public
https://github.com/apache/bahir/blob/master/streaming-twitter/examples/src/main/scala/org/apache/spark/examples/streaming/twitter/TwitterPopularTags.scala
https://github.com/apache/bahir/tree/master/streaming-twitter
http://bahir.apache.org/docs/spark/2.0.1/spark-streaming-twitter/
https://gitter.im/andypetrella/spark-notebook

Things to try

Run the TwitterPopularTags.scala example from above to test the feed works with the Spark API.
Try using local version of Spark instead of Docker version
Try using Twitter4j directly (see https://github.com/yusuke/twitter4j/blob/master/twitter4j-examples/src/main/java/twitter4j/examples/stream/PrintSampleStream.java)

From last night

Use streaming-Twitter stream.snb to get a Twitter stream up and running

For testing

twurl -t -H stream.twitter.com /1.1/statuses/sample.json

Env variables dump

import scala.collection.JavaConversions._

val environmentVars = System.getenv()
for ((k,v) <- environmentVars) println(s"key: $k, value: $v")

val properties = System.getProperties()
for ((k,v) <- properties) println(s"key: $k, value: $v")

robmoore / cautious-carnival

Notes for setting up Spark Notebook example

Reference

Things to try

From last night

For testing

Env variables dump

About