Thundercats

Write Spark in Functional way.

Motivation

What if we can write Spark in the following Monadic way.

val p = for {
  a <- IO.Read.parquet("foo.parquet")
  b <- IO.Read.csv("bar.csv", header=True)
  c <- IO.Read.kafka("topic", limit=1000)
  d <- Join.left(a, b, "col1")
  _ <- IO.Write.parquet(d, "new_foo.parquet")
  e <- Group.agg(d, by="col2", sum("col3").as("t"), avg("col4").as("v"))
} yield e

val q = for {
  a <- p
  _ <- IO.Write.kafka("topic", a)
  b <- Filter(a, $("col1") > 35)
} yield b.cache

Supported Data Sources

Physical file types: CSV, Parquet
Streaming sources: Kafka
MongoDB
Amazon DynamoDB

Prerequisites

Scala 2.12
sbt 1.3.3
JVM 8
Hadoop 3.2
Spark 3.0

Workaround: Java 8 instead of Java 13, compatibility issue

Spark 2 still has a known issue with Java 13 so it is recommended to install and activate Java 8 on the machine. On MacOS, the users can do the following. NOTE This installation is slow.

$ brew tap AdoptOpenJDK/openjdk
$ brew cask install adoptopenjdk/openjdk/adoptopenjdk8

Then activate the Java8

$ export JAVA_HOME=$(/usr/libexec/java_home -v 1.8)

To switch back the Java13, then

$ export JAVA_HOME=$(/usr/libexec/java_home -v 13)

Note that you can list all installed Javas by

$ ll /Library/Java/JavaVirtualMachines

Tests

Following dependencies are required to run the test suite.

Hadoop
Docker

Run a full unittest

./run-test.sh

The script starts instances of required components, e.g. DynamoDB, with Docker compose and start a unittest normally with sbt test. After all tests are done, the script also stops the docker instances for you.

Usage

Add Thundercat to your project and import the following.

import com.tao.thundercats.physical._

Read dataframes from files

Currently CSV and parquet are supported

val df = for {
  a <- Read.csv("path/to/file.csv", withHeader=false, delimiter=";")
  b <- Read.csv("path/to/file.csv")
  c <- Read.parquet("path/to/file.parquet")
  d <- Read.mongo("localhost", "db1", "collection1")
} yield ???

Write to files

for {
  ...
  a <- Write.csv(df, "path/to/file.csv", withHeader=true, delimiter="\t")
  b <- Write.csv(df, "path/to/file.csv")
  c <- Write.parquet(df, "path/to/file.parquet")
} yield ???

Read and write Kafka

for {
  ...
  a <- Read.kafkaStream("topic", "server-address", 9092) // Stream
  b <- Read.kafka("topic", "server-address", 9092) // Batch
  c <- Read.kafka("topic", "server-address", colEncoder=ColumnEncoder.Avro(schemaStr))
  ...
  _ <- Write.kafkaStream(dfStream, "topic", "server-address", 9092)
  _ <- Write.kafka(dfBatch, "topic", "server-address", 9092, ColumnEncoder.Avro(schemaStr))
  _ <- Write.kafka(dfBatch, "topic", colEncoder=ColumnEncoder.Avro(schemaStr))
  _ <- Write.kafka(dfBatch, "topic")

  ...
} yield ???

Read from MongoDB

for {
  ...
  a <- Read.mongo("127.0.0.1", "db-name", "collection-name")
  ...
} yield ???

Show streaming dataframe

for {
  ...
  a <- Read.kafkaStream("topic", "server-address", 9092) // Stream
  _ <- Screen.showDFStream(a, title=Some("Streaming dataframe"))
}

Show normal dataframe (bounded)

for {
  ...
  a <- Read.csv("path.csv")
  _ <- Screen.showDF(a)
}

Join, filter, groupby

for {
  ...
  a <- Read.csv("path")
  b <- Read.parquet("path")
  c <- Read.kafka("topic")
  ...
  f <- Join.outer(a, b, Join.on("key" :: Nil))
  g <- Join.left(a, b, Join.on("key" :: Nil))
  k <- Join.inner(a, b, Join.on("key" :: "key2" :: Nil))
  _ <- Join.inner(a, b, Join.with('key :: 'value * 10 :: Nil)) // Using column objects
  ...
  n <- Group.agg(f, Seq('key, min('value)), Group.map(
    "value" -> "min", 
    "value" -> "avg",
    "n" -> "collect_set"))
  m <- Group.agg(f, Seq('key), Group.agg(min('value), max('value), collect_set('value))
  ...
  q <- Filter.where(n, 'value > 250)
}
yield ???

Apply mapping function: DataFrame => DataFrame

If you have a function of signature DataFrame => DataFrame, you can also apply it with binding operator as follows.

for {
  ...
  a <- Read.csv("path")
  b <- Read.parquet("path")
  ...
  h <- b >> (_.withColumn("c", lit(true)))
  _ <- b >> (_.withColumn("d", explode("array")))
}
yield ???

WithColumn equivalence

To add new column into for comprehension, do following

for {
  ...
  a <- Read.csv("path")
  b <- Read.parquet("path")
  ...
  z <- F.addColumn(a, "new_col", explode('old_col))
  w <- F.addColumn(z, "new_col2", add_months('old_col_m, 4))
}
yield ???

Samples

Check out the project "samples" for example usage of Thundercats. Build and package as JAR, then submit to your spark cluster of choice.

Run samples locally

You can try samples on your local workstation too. Follow the instructions below

Copy data files into $HOME/data

mkdir -p $HOME/data
cp samples/src/main/resources/*.csv $HOME/data

Package JAR

$ sbt samples/assembly

Start spark shell with the packaged JAR

$ $SPARK_HOME/bin/spark-shell --jars samples/target/scala-2.12/samples-assembly-0.1.0-SNAPSHOT.jar

You can choose to run any sample instances. See following samples.

scala> import com.tao.thundercats.samples.subapp._

scala> DataPipeline.runMe(spark)

Datasets used in samples

All licences and copyrights belong to the original owners of those datasets.

Licence

Apache licence. Redistribution, modification, private use, sublicencing are permitted.

tao-pr / thundercats