This project provides support for DynamoDB extension to Spark DataFrames. This paper provides good introduction to Spark SQL.
As described in section 4.4 of Spark SQL: Relational Data Processing in Spark paper, the DefaultSource
class implements createRelation
function.
The DynamoDBRelation
class extends BaseRelation
with TableScan
which returns RDD of Rows.
This library requires:
- Spark 1.4.1
- Scala - 2.10.4
- AWS-Java-SDK - 1.10.11
This library reads Amazon DynamoDB tables as Spark DataFrames. The API accepts several options:
tableName
: name of the DynamoDB tableregion
: region of the DynamoDB table e.geu-west-1
scanEntireTable
: defaults to true, if set false only first page of items is read.
If the schema is not provided, the API will infer key schema of DynamoDB table which will only include hash and range key attributes. Depending on the attributes specified in schema, the API will only read these columns in DynamoDB table.
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val schema = StructType(
Seq(
StructField("name", StringType),
StructField("age", LongType)
)
)
val options = Map(
"tableName" -> "users",
"region" -> "eu-west-1",
"scanEntireTable" -> "false",
"accessKeyId" -> "FILL_ME_IN",
"secretAccessKey" -> "FILL_ME_IN"
)
val df = sqlContext.read.format("com.onzo.spark.dynamodb").
schema(schema).options(options).load()
This library is built using sbt. To build an assembly jar, simply run sbt assembly
from the project root.