theclaymethod / spark-notebook

Interactive Scala REPL in a browser

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Spark Notebook

Fork of the amazing scala-notebook, yet focusing on Massive Dataset Analysis using Apache Spark.

Description

The main intent of this tool is to create reproducible analysis using Scala, Apache Spark and more.

This is achieved through an interactive web-based editor that can combine Scala code, SQL queries, Markup or even JavaScript in a collaborative manner.

The usage of Spark comes out of the box, and is simply enabled by the implicit variable named sparkContext.

Launch

Very quickstart

Long story short, there is a small script that can help you setup and launch it without any other requirements than a java environment.

curl https://raw.githubusercontent.com/andypetrella/spark-notebook/spark/run.sh | bash -s dev

Longer story

The spark notebook requires a Java(TM) environment (aka JVM) as runtime and SBT as build tool.

You will also need a working GIT installation to download the code and build it.

Procedure

Download the code

git clone https://github.com/andypetrella/spark-notebook.git
cd spark-notebook

Launch the server

Enter the sbt console by running sbt within the spark-notebook:

spark-notebook$ sbt
[warn] Multiple resolvers having different access mechanism configured with same name 'sbt-plugin-releases'. To avoid conflict, Remove duplicate project resolvers (`resolvers`) or rename publishing resolver (`publishTo`).
[info] Updating {file:/home/noootsab/src/noootsab/spark-notebook/project/}spark-notebook-build...
...
...
...

Then you can head to the server project and run it. You have several options available:

  • --disable_security: this will disable the Akka secure cookie (helpful in local env)
  • --no_browser: this prevents the project to open a page in your browser everytime the server is started
> project server
> run --disable_security

Use

When the server has been started, you can head to the page http://localhost:8899 and you'll see something similar to: Notebook list

From there you can either:

  • create a new notebook
  • launch an existing notebook

In both case, the scala-notebook will open a new tab with your notebook in it, loaded as a web page.

Note: a notebook is a JSON file containing the layout and analysis blocks, and it's located within the project folder (with the snb extension). Hence, they can be shared and we can track their history in an SVM like GIT.

Features

Use/Reconfigure Spark

Since this project aims directly the usage of Spark, a SparkContext is added to the environment and can directly be used without additional effort.

Example using Spark

Spark will start with a regular/basic configuration. To customize the embedded Spark to your environment, a function reset is also provided: This function takes several parameters, but the most important one is lastChanges which is itself a function that can adapt the SparkConf. This way, we can change the master, the executor memory and a cassandra sink or whatever before restarting it. For more Spark configuration options see: [Spark Configuration](Spark Configuration)

In this example we reset SparkContext and add configuration options to use the [cassandra-connector]:

import org.apache.spark.{Logging, SparkConf}
val cassandraHost:String = "localhost" 
reset(lastChanges= _.set("spark.cassandra.connection.host", cassandraHost))

This makes Cassandra connector avaible in the Spark Context. Then you can use it, like so:

import com.datastax.spark.connector._
sparkContext.cassandraTable("test_keyspace", "test_column_family")

Using (Spark)SQL

Spark comes with this handy and cool feature that we can write some SQL queries rather than boilerplating with Scala or whatever code, with the clear advantage that the resulting DAG is optimized.

The spark-notebook offers SparkSQL support. To access it, we first we need to register an RDD as a table:

dataRDD.registerTempTable("data")

Then we can play with this data table like so:

:sql select col1 from data where col2 == 'thingy'

This will give access to the result via the resXYZ variable.

This is already helpful, but the resXYZ nummering can change and is not friendly, so we can also give a name to the result:

:sql[col1Var] select col1 from data where col2 == 'thingy'

Now, we can use the variable col1Var which is an RDD that we can manipulate further.

This is how it looks like in the notebook:

Using SQL

Interacting with JavaScript

Showing numbers can be good but great analysis reports should include relevant charts, for that we need JavaScript to manipulate the Notebook's DOM.

For that purpose, a notebook can use the Playground abstraction. It allows us to create data in Scala and use it in predefined JavaScript functions (located under observable/src/main/assets/observable/js) or even JavaScript snippets (that is, written straight in the notebook as a Scala String to be sent to the JavaScript interpreter).

The JavaScript function will be called with these parameters:

  • the data observable: a JS function can register its new data via subscribe.
  • the dom element: so that it can update it with custom behavior
  • an extra object: any additional data, configuration or whatever that comes from the Scala side

Here is how this can be used, with a predefined consoleDir JS function (see here):

Simple Playground

Another example using the same predefined function and example to react on the new incoming data (more in further section). The new stuff here is the use of Codec to convert a Scala object into the JSON format used in JS:

Playground with custom data

Plotting with D3

Plotting with D3.js is rather common now, however it's not always simple, hence there is a Scala wrapper that brings the boostrap of D3 in the mix.

These wrappers are D3.svg and D3.linePlot, and they are just proof of concept for now. The idea is to bring Scala data to D3.js then create Coffeescript to interact with them.

For instance, linePlot is used like so:

Using Rickshaw

Note: This is subject to future change because it would be better to use playground for this purpose.

Timeseries with Rickshaw

Plotting timeseries is very common, for this purpose the spark notebook includes Rickshaw that quickly enables handsome timeline charts.

Rickshaw is available through Playground and a dedicated function for simple needs rickshawts.

To use it, you are only required to convert/wrap your data points into a dedicated Series object:

def createTss(start:Long, step:Int=60*1000, nb:Int = 100):Seq[Series] = ...
val data = createTss(orig, step, nb)

val p = new Playground(data, List(Script("rickshawts", 
                                         ("renderer" -> "stack")
                                         ~ ("fixed" -> 
                                              ("interval" -> step/1000)
                                              ~ ("max" -> 100)
                                              ~ ("baseInSec" -> orig/1000)
                                           )
                                        )))(seriesCodec)

As you can see, the only big deal is to create the timeseries (Seq[Series] which is a simple wrapper around:

  • name
  • color
  • data (a sequence of x and y)

Also, there are some options to tune the display:

  • provide the type of renderer (line, stack, ...)
  • if the timeseries will be updated you can fix the window by supplying the fixed object:
  • interval (at which data is upated)
  • max (the max number of points displayed)
  • the unit in the X axis.

Here is an example of the kind of result you can expect:

Using Rickshaw

Dynamic update of data and plot using Scala's Future

One of the very cool things that is used in the original scala-notebook is the use of reactive libs on both sides: server and client, combined with WebSockets. This offers a neat way to show dynamic activities like streaming data and so on.

We can exploit the reactive support to update Plot wrappers (the Playground instance actually) in a dynamic manner. If the JS functions are listening to the data changes they can automatically update their result.

The following example is showing how a timeseries plotted with Rickshaw can be regularly updated. We are using Scala Futures to simulate a server side process that would poll for a third-party service:

Update Timeseries Result

The results will be:

Update Timeseries Result

Update Notebook ClassPath

Keeping your notebook runtime updated with the libraries you need in the classpath is usually cumbersome as it requires updating the server configuration in the SBT definition and restarting the system. Which is pretty sad because it requires a restart, rebuild and is not contextual to the notebook!

Hence, a dedicated context has been added to the block, :cp which allows us to add specifiy local paths to jars that will be part of the classpath.

:cp /home/noootsab/.m2/repository/joda-time/joda-time/2.4/joda-time-2.4.jar

Or even

:cp 
/tmp/scala-notebook/repo/com/codahale/metrics/metrics-core/3.0.2/metrics-core-3.0.2.jar
/tmp/scala-notebook/repo/org/scala-lang/scala-compiler/2.10.4/scala-compiler-2.10.4.jar
/tmp/scala-notebook/repo/org/scala-lang/scala-library/2.10.4/scala-library-2.10.4.jar
/tmp/scala-notebook/repo/joda-time/joda-time/2.3/joda-time-2.3.jar
/tmp/scala-notebook/repo/commons-logging/commons-logging/1.1.1/commons-logging-1.1.1.jar
/tmp/scala-notebook/repo/com/datastax/cassandra/cassandra-driver-core/2.0.4/cassandra-driver-core-2.0.4.jar
/tmp/scala-notebook/repo/org/apache/thrift/libthrift/0.9.1/libthrift-0.9.1.jar
/tmp/scala-notebook/repo/org/apache/httpcomponents/httpcore/4.2.4/httpcore-4.2.4.jar
/tmp/scala-notebook/repo/org/joda/joda-convert/1.2/joda-convert-1.2.jar
/tmp/scala-notebook/repo/org/scala-lang/scala-reflect/2.10.4/scala-reflect-2.10.4.jar
/tmp/scala-notebook/repo/org/apache/cassandra/cassandra-clientutil/2.0.9/cassandra-clientutil-2.0.9.jar
/tmp/scala-notebook/repo/org/slf4j/slf4j-api/1.7.2/slf4j-api-1.7.2.jar
/tmp/scala-notebook/repo/com/datastax/cassandra/cassandra-driver-core/2.0.4/cassandra-driver-core-2.0.4-sources.jar
/tmp/scala-notebook/repo/io/netty/netty/3.9.0.Final/netty-3.9.0.Final.jar
/tmp/scala-notebook/repo/org/apache/commons/commons-lang3/3.3.2/commons-lang3-3.3.2.jar
/tmp/scala-notebook/repo/commons-codec/commons-codec/1.6/commons-codec-1.6.jar
/tmp/scala-notebook/repo/org/apache/httpcomponents/httpclient/4.2.5/httpclient-4.2.5.jar
/tmp/scala-notebook/repo/org/apache/cassandra/cassandra-thrift/2.0.9/cassandra-thrift-2.0.9.jar
/tmp/scala-notebook/repo/com/datastax/spark/spark-cassandra-connector_2.10/1.1.0-alpha1/spark-cassandra-connector_2.10-1.1.0-alpha1.jar
/tmp/scala-notebook/repo/com/google/guava/guava/15.0/guava-15.0.jar

Here is what it'll look like in the notebook:

Simple Classpath

Update Spark jars

If you are a Spark user, you know that it's not enough to have the jars locally added to the Driver's classpath. Indeed, workers needs to also load them. The usual way would be to update the list of jars provided to the SparkConf using the reset function explained above.

However, this can be very tricky when we need to add jars that have themselves plenty of dependencies. This notebook offers an extra feature to facilitate this dependency loading:

resolveAndAddToJars

This function takes three parameters:

  • groupId
  • artifactId
  • version

It creates a local repository and downloads the project with all its dependecies. The repo where these libs will be downloaded can be updated using the updateRepo. For example, if you want to use Cassandra Spark connector, all you need to do is:

resolveAndAddToJars("com.datastax.spark", "spark-cassandra-connector_2.10", "1.1.0-alpha3")

/!\Then update the Driver cp with the listed jars. (This will change!)

In live:

Spark Jars

Note: Aether is used to do this, so if there is something special needed that it's not provided, we can look at how Aether would enable it!

About

Interactive Scala REPL in a browser

License:Other