rovio-ingest

An implementation of the DatasourceV2 interface of Apache Spark™ for writing Spark Datasets to Apache Druid™.

For example, this allows reading an Apache Hive™ table and ingesting it to Druid in a single Spark job.

Overview
Software requirements
Motivation
Dataset extension
Data type conversions
Segment granularity
Usage
- PySpark
- Scala
- Java
- Write options
Limitations
Development
Building
Troubleshooting

Overview

Overview diagram:

Apache Druid segments are written out to Deep Storage from Apache Spark executors
Apache Spark driver updates the Apache Druid segment metadata table after all Apache Spark executors have completed.
Old overshadowed segments are left in place so that Apache Druid process can properly replace them with newer segments (uses locking).

Software requirements

rovio-ingest requires

Apache Spark 3*
JDK8 (8u92+ or later).
Linux or MacOS.
Git repo must be cloned under a path that doesn't have spaces.

*) Spark 2 is not supported, but feel free to create a Github issue if you would need that.

Motivation

The Druid StorageHandler of Hive is broken in AWS EMR. For example, AWS emr-5.29.0 uses Apache Hive 2.3.4 and it requires additional jar patching to make the Druid StorageHandler work. Even then only CTAS is supported, which means that incremental updates are not possible. Further, it is not possible to create these Apache Hive tables from Apache Spark as they are created with StorageHandlers. Hive 3 might bring some improvement, but the Druid StorageHandler doesn't work AWS emr-6.2.0 either, even though it has Hive 3.

To summarize, these are the features that rovio-ingest addresses, that are lacking with Hive-Druid integration:

Treat __time column as partition dimension
Support for insert overwrite with Apache Hive backed Apache Druid table (for Hive version < 3.0)
Flexibility to use latest Apache Druid libraries.
Update Apache Druid segments in transaction, leveraging Apache Druid's segment locking mechanism.

Dataset extension

The Dataset extension performs the following validations:

Type of time_column is Date or Timestamp
The Dataset has one or more metric columns
The Dataset has one or more dimension columns
The Dataset has no columns with unknown types, unless excludeColumsWithUnknownTypes is set to true

The Dataset extension performs the following transformations:

Drops all columns of complex datatypes such as StructType, MapType or ArrayType as they are not supported by DruidSource. This is only done if excludeColumnsWithUnknownTypes is set to true, otherwise validation has already failed.
Adds a new column __PARTITION_TIME__ whose value is based on time_column column and the given segment granularity (applies NormalizeTimeColumnUDF)
Adds a new column __PARTITION_NUM__ to denote partition shard which is (row_number() - 1) / rows_per_segment for each partition
Partitions the dataset with columns __PARTITION_TIME__ & __PARTITION_NUM__

So it returns a newly partitioned Dataset with columns __PARTITION_TIME__ & __PARTITION_NUM__ added. __PARTITION_TIME__ & __PARTITION_NUM__ columns are always excluded from segment metrics.

Data type conversions

The following type conversions are done on ingestion:

Float is converted to Double
- See Druid Docs / Double Column storage)
Date/Timestamp is converted to String, except for the time_column
- See Druid Docs / Data types

Segment granularity

For example, when the input __time is 2019-01-02:

if granularity = DAY, then time interval = 2019-01-02/2019-01-03
if granularity = MONTH, then time interval = 2019-01-01/2019-02-01
if granularity = YEAR, then time interval = 2019-01-01/2020-01-01

The com.rovio.ingest.DruidSource expects that only one task writes out segments for a given time interval and granularity. This is so that the task correctly shards DataSegments into requisite partitions based on segment size configuration. It is the responsibility of Spark program to correctly partition the dataset based on time interval first, which is handled by the provided Dataset extensions of rovio-ingest.

Usage

Code snippets for different languages are provided below.

The main steps are always the same:

The provided Dataset extension is first used to validate and repartition a Dataset so that it satisfies the requirements of DruidSource
Then the Dataset can be written with the DruidSource format

PySpark

Installing the python module

Install the latest stable release:

pip install rovio-ingest

Alternatively, install the latest pre-release version:

pip install 'rovio-ingest>0.0.0' --pre

If you use pip to install the python module, you need to do it separately on the Spark cluster.

A more isolated alternative is to build a python zip and copy it to s3, then include it in pyFiles. For example:

.conf("spark.submit.pyFiles",
      "s3://my-bucket/my/prefix/rovio_ingest.zip")

Spark configuration

Set the following spark conf:

.conf("spark.jars.packages",
      "com.rovio.ingest:rovio-ingest:1.0.1_spark_3.0.1") \

PySpark job example

from rovio_ingest import DRUID_SOURCE
from rovio_ingest.extensions.dataframe_extension import ConfKeys, add_dataframe_druid_extension

add_dataframe_druid_extension()

# df: DataFrame

df.repartition_by_druid_segment_size(partition_col) \
    .write \
    .mode("overwrite") \
    .format(DRUID_SOURCE) \
    .option(ConfKeys.DATA_SOURCE, "target-datasource-name-in-druid") \
    .option(ConfKeys.TIME_COLUMN, "date") \
    .option(ConfKeys.METADATA_DB_URI, "jdbc:mysql://localhost:3306/druid") \
    .option(ConfKeys.METADATA_DB_USERNAME, "username") \
    .option(ConfKeys.METADATA_DB_PASSWORD, "password") \
    .option(ConfKeys.DEEP_STORAGE_S3_BUCKET, "my-bucket") \
    .option(ConfKeys.DEEP_STORAGE_S3_BASE_KEY, "druid/prod/segments") \
    .save()

This is using imported constants for safety.

For clarity, below is the same code with plain string keys:

from rovio_ingest.extensions.dataframe_extension import add_dataframe_druid_extension

add_dataframe_druid_extension()

# df: DataFrame

df.repartition_by_druid_segment_size(partition_col) \
    .write \
    .mode("overwrite") \
    .format("com.rovio.ingest.DruidSource") \
    .option("druid.datasource", "target-datasource-name-in-druid") \
    .option("druid.time_column", "date") \
    .option("druid.metastore.db.uri", "jdbc:mysql://localhost:3306/druid") \
    .option("druid.metastore.db.username", "username") \
    .option("druid.metastore.db.password", "password") \
    .option("druid.segment_storage.s3.bucket", "my-bucket") \
    .option("druid.segment_storage.s3.basekey", "druid/prod/segments") \
    .save()

Scala

Maven: see Java.

A Dataset[Row] extension is provided to repartition the dataset for the DruidSource Datasource.

For an interactive spark session you can set the following spark conf:

("spark.jars.packages", "com.rovio.ingest:rovio-ingest:1.0.1_spark_3.0.1")

To use a snapshot version:

("spark.jars.repositories", "https://s01.oss.sonatype.org/content/repositories/snapshots"),
("spark.jars.packages", "com.rovio.ingest:rovio-ingest:1.0.1_spark_3.0.1-SNAPSHOT")

import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession}
import com.rovio.ingest.extensions.DruidDatasetExtensions._
import com.rovio.ingest.WriterContext.ConfKeys

val ds: Dataset[Row] = ???

val options = Map[String, String](
  ConfKeys.METADATA_DB_URI -> "jdbc:mysql://localhost:3306/druid",
  ConfKeys.METADATA_DB_USERNAME -> "username",
  ConfKeys.METADATA_DB_PASSWORD -> "password",
  ConfKeys.DEEP_STORAGE_S3_BUCKET-> "my-bucket",
  ConfKeys.DEEP_STORAGE_S3_BASE_KEY -> "druid/prod/segments"
)

ds.repartitionByDruidSegmentSize("date")
  .write
  .mode(SaveMode.Overwrite)
  .options(options)
  .druid("target-datasource-name-in-druid", timeColumn="date")

This is a minimal example. See Write options for more.

Java

Maven (for a full example, see examples/rovio-ingest-maven-example):

<dependencies>
    <dependency>
        <groupId>com.rovio.ingest</groupId>
        <artifactId>rovio-ingest</artifactId>
        <version>1.0.1_spark_3.0.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.logging.log4j</groupId>
        <artifactId>log4j-core</artifactId>
        <version>2.11.1</version>
        <scope>provided</scope>
    </dependency>
</dependencies>

A DruidDataset wrapper class is provided to repartition the dataset for the DruidSource DataSource.

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SaveMode;
import com.rovio.ingest.extensions.java.DruidDatasetExtensions;
import com.rovio.ingest.DruidSource;
import com.rovio.ingest.WriterContext.ConfKeys;

Dataset<Row> dataset = ...;

Map<String, String> options = new HashMap<>();

options.put(ConfKeys.DATA_SOURCE, "target-datasource-name-in-druid");
options.put(ConfKeys.TIME_COLUMN, "date");
options.put(ConfKeys.METADATA_DB_URI, "jdbc:mysql://localhost:3306/druid");
options.put(ConfKeys.METADATA_DB_USERNAME, "username");
options.put(ConfKeys.METADATA_DB_PASSWORD, "password");
options.put(ConfKeys.DEEP_STORAGE_S3_BUCKET, "my-bucket");
options.put(ConfKeys.DEEP_STORAGE_S3_BASE_KEY, "druid/prod/segments");

DruidDatasetExtensions
    .repartitionByDruidSegmentSize(dataset,"date", "DAY", 5000000, false)
    .write()
    .format(DruidSource.FORMAT)
    .mode(SaveMode.Overwrite)
    .options(options)
    .save();

This is a minimal example. See Write options for more.

Write options

These are the options for DruidSource, to be passed with write.options().

Required properties

Property	Description
`druid.datasource`	Name of the target datasource in Druid
`druid.time_column`	Name of the column in the Spark DataFrame to be translated as Druid `__time` interval. Must be of `TimestampType`.
`druid.metastore.db.uri`	Druid Metadata Storage database URI
`druid.metastore.db.username`	Druid Metadata Storage database username
`druid.metastore.db.password`	Druid Metadata Storage database password
`druid.metrics_spec`	List of metrics aggregation provided as json string, when not provided defaults to using sum aggregator for all numeric columns.

+ Storage type specific properties depending on value of druid.segment_storage.type:

If Deep Storage is s3 (default):

Property Description

druid.segment_storage.s3.bucket S3 bucket name for the Deep Storage

druid.segment_storage.s3.basekey S3 key prefix for the Deep Storage. No trailing slashes.
If Deep Storage is local:

Property Description

druid.segment_storage.local.dir For local Deep Storage, absolute path to segment directory

Property	Description
`druid.segment_storage.s3.bucket`	S3 bucket name for the Deep Storage
`druid.segment_storage.s3.basekey`	S3 key prefix for the Deep Storage. No trailing slashes.

Property	Description
`druid.segment_storage.local.dir`	For local Deep Storage, absolute path to segment directory

Optional properties

Property	Description	Default
`druid.metastore.db.table.base`	Druid Metadata Storage database table prefix	`druid`
`druid.segment_granularity`	Segment Granularity	`DAY`
`druid.query_granularity`	Query granularity	`DAY`
`druid.exclude_dimensions`	Comma separated list of Spark input columns that have to be excluded in Druid ingestion
`druid.segment.max_rows`	Max number of rows per segment	`5000000`
`druid.memory.max_rows`	Max number of rows to keep in memory in spark data writer	`75000`
`druid.segment_storage.type`	Type of Deep Storage to use. Allowed values: `s3`, `local`.	`s3`
`druid.datasource.init`	Boolean flag for (re-)initializing Druid datasource. If `true`, any pre-existing segments for the datasource is marked as unused.	`false`
`druid.bitmap_factory`	Compression format for bitmap indexes. Possible values: `concise`, `roaring`. For type `roaring`, the boolean property compressRunOnSerialization is always set to `true`. `rovio-ingest` uses `concise` by default regardless of Druid library version.	`concise`
`druid.segment.rollup`	Whether to rollup data during ingestion	`true`

Limitations

DruidSource is a write-only DataSource and supports only:

overwrite as Spark write mode
S3 as Druid Deep Storage
- Also local Deep Storage, but it's only useful for testing
MySQL as Druid Metadata Storage

Contributions are welcome to support other write modes or combinations of Deep Storage & Metadata Storage.

Development

The tests use TestContainers and require Docker running on the local machine.

Java / Scala

Project is managed by Maven (pom.xml).

To build the project

mvn compile

To run tests

mvn test

Python

Python code is managed by poetry.

To create the venv with required modules, cd to the python subfolder and run

poetry install

To run tests

poetry run pytest

Building

Building rovio-ingest (jar)

To build the jar package:

mvn package

Shaded uber-jar: target/rovio-ingest-{version}.jar
- Standalone. Includes relocated packages to avoid version conflicts
Plain project jar: target/original-rovio-ingest-{version}.jar
- Requires the maven dependencies (see pom.xml)

The recommended way is to build a shaded jar and use it.

To test the jar in practice, see this notebook as an example.

Building rovio_ingest (python)

To build a zip that is suitable for passing to spark with --py-files or spark.submit.pyFiles:

cd python \
  && zip --exclude='*.pyc' --exclude='*__pycache__*' --exclude='*~' --exclude='.pytest_cache' \
    -FSr ../target/rovio_ingest.zip rovio_ingest ; cd ..

To build a wheel that can be installed with pip – typically before spark session creation (requires poetry installation):

cd python
poetry build

Troubleshooting

Writing may fail with OOM, eg.

# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 9950"...

To fix, decrease druid.memory.max_rows in write options or increase spark executor memory.

_{Apache®, Apache Hive™, Apache Spark™, the Apache Spark project logo, Apache Druid™, and the Apache Druid project logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. All other marks mentioned may be trademarks or registered trademarks of their respective owners. No endorsement by or affiliation with the Apache Software Foundation or any third-party trademark owner is claimed or implied by the use of these marks.}

rocel / rovio-ingest