robmoore / cautious-carnival

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Notes for setting up Spark Notebook example

  1. Run the following commands:
    docker pull andypetrella/spark-notebook:0.7.0-scala-2.11.8-spark-2.0.2-hadoop-2.7.2
    docker pull cassandra
    docker run --name my-cassandra -d cassandra:latest # or docker start my-cassandra if already present
    docker run -p 9001:9001 --link my-cassandra andypetrella/spark-notebook:0.7.0-scala-2.11.8-spark-2.0.1-hadoop-2.7.2
    
  2. Access Spark Notebook
  3. Drag and drop Twitter to Cassandra.snb on to the Spark Notebook files tab
  4. Run docker ps | grep my-cassandra | cut -f1 -d" " to determine Cassandra instance hostname. Note: The IP address for the Cassandra instance is in the environment variables for the Spark Notebook instance. This is true if --link is used when running Spark Notebook. For example, MY_CASSANDRA_PORT_9042_TCP_ADDR, value: 172.17.0.2.
  5. Set up metadata (Edit -> Edit Metadata)
    "customLocalRepo": "/tmp/repo",
      "customDeps": [
        "com.datastax.spark % spark-cassandra-connector_2.11 % 2.0.0-M3",
        "org.apache.bahir % spark-streaming-twitter_2.11 % 2.0.1"
      ],
      "customSparkConf": {
        "spark.cassandra.connection.host": "<cassandra-hostname-here>"
      },
    
  6. Execute cells in notebook

Reference

Things to try

  1. Run the TwitterPopularTags.scala example from above to test the feed works with the Spark API.
  2. Try using local version of Spark instead of Docker version
  3. Try using Twitter4j directly (see https://github.com/yusuke/twitter4j/blob/master/twitter4j-examples/src/main/java/twitter4j/examples/stream/PrintSampleStream.java)

From last night

  • Use streaming-Twitter stream.snb to get a Twitter stream up and running

For testing

twurl -t -H stream.twitter.com /1.1/statuses/sample.json

Env variables dump

import scala.collection.JavaConversions._

val environmentVars = System.getenv()
for ((k,v) <- environmentVars) println(s"key: $k, value: $v")

val properties = System.getProperties()
for ((k,v) <- properties) println(s"key: $k, value: $v")

About