Spark Agent / Harvester

The Spline agent for Apache Spark is a complementary module to the Spline project that captures runtime lineage information from the Apache Spark jobs.

The agent is a Scala library that is embedded into the Spark driver, listening to Spark events, and capturing logical execution plans. The collected metadata is then handed over to the lineage dispatcher, from where it can be either send to the Spline server (e.g. via REST API or Kafka), or used in another way depending on selected dispatcher type (see Lineage Dispatchers).

The agent can be used with or without a Spline server, depending on your use case. See References.

Spark / Scala version compatibility matrix
Usage
- Selecting artifact
- Initialization
  - Codeless
  - Programmatic
Configuration
Spark features coverage
Developer documentation
- Plugin API
- Building for different Scala and Spark versions
References and Examples

Spark / Scala version compatibility matrix

	Scala 2.11	Scala 2.12
Spark 2.2	(no SQL; no codeless init)	—
Spark 2.3	(no Delta support)	—
Spark 2.4	Yes	Yes
Spark 3.0	—	Yes
Spark 3.1	—	Yes

Usage

Selecting artifact

There are two main agent artifacts:

agent-core is a Java library that you can use with any compatible Spark version. Use this one if you want to include Spline agent into your custom Spark application, and you want to manage all transitive dependencies yourself.
spark-spline-agent-bundle is a fat jar that is designed to be embedded into the Spark driver, either by manually copying it to the Spark's /jars directory, or by using --jars or --packages argument for the spark-submit, spark-shell or pyspark commands. This artifact is self-sufficient and is aimed to be used by most users.

Because the bundle is pre-built with all necessary dependencies, it is important to select a proper version of it that matches the minor Spark and Scala versions of your target Spark installation.

spark-A.B-spline-agent-bundle_X.Y.jar

here A.B is the first two Spark version numbers and X.Y is the first two Scala version numbers. For example, if you have Spark 2.4.4 pre-built with Scala 2.12.10 then select the following agent bundle:

spark-2.4-spline-agent-bundle_2.12.jar

Initialization

Spline agent is basically a Spark query listener that needs to be registered in a Spark session before is can be used. Depending on if you are using it as a library in your custom Spark application, or as a standalone bundle you can choose one of the following initialization approaches.

Codeless Initialization

This way is the most convenient one, can be used in majority use-cases. Simply include the Spline listener into the spark.sql.queryExecutionListeners config property (see Static SQL Configuration)

Example:

pyspark \
  --packages za.co.absa.spline.agent.spark:spark-2.4-spline-agent-bundle_2.12:<VERSION> \
  --conf "spark.sql.queryExecutionListeners=za.co.absa.spline.harvester.listener.SplineQueryExecutionListener" \
  --conf "spark.spline.lineageDispatcher.http.producer.url=http://localhost:9090/producer"

The same approach works for spark-submit and spark-shell commands.

Note: all Spline properties set via Spark conf should be prefixed with spark. prefix in order to be visible to the Spline agent.
See Configuration section for details.

Programmatic Initialization

Note: starting from Spline 0.6 most agent components can be configured or even replaced in a declarative manner either using Configuration or Plugin API. So normally there should be no need to use a programmatic initialization method. We recommend to use Codeless Initialization instead.

But if for some reason, Codeless Initialization doesn't fit your needs, or you want to do more customization on Spark agent, you can use programmatic initialization method.

// given a Spark session ...
val sparkSession: SparkSession = ???

// ... enable data lineage tracking with Spline
import za.co.absa.spline.harvester.SparkLineageInitializer._
sparkSession.enableLineageTracking()

// ... then run some Dataset computations as usual.
// The lineage will be captured and sent to the configured Spline Producer endpoint.

or in Java syntax:

import za.co.absa.spline.harvester.SparkLineageInitializer;
// ...
SparkLineageInitializer.enableLineageTracking(session);

The method SparkLineageInitializer.enableLineageTracking() is overloaded, and accepts SplineConfigurer optional parameter. SplineConfigurer is a factory that creates a Spark execution listener. By default, DefaultSplineConfigurer is used. If you want to override some of Spline agent behavior you can extend DefaultSplineConfigurer or create your own implementation of SplineConfigurer trait, and pass it into the SparkLineageInitializer.enableLineageTracking() method.

Configuration

The agent looks for configuration properties in the following sources (in order of precedence):

Hadoop configuration (core-site.xml)
Spark configuration
JVM system properties
spline.properties file on classpath

The file spline.default.properties contains default values for all Spline properties along with additional documentation. It's a good idea to look in the file to see what properties are available.

Properties

`spline.mode`

DISABLED Lineage tracking is completely disabled and Spline is unhooked from Spark.
REQUIRED If Spline fails to initialize itself (e.g., wrong configuration, no db connection) the Spark application aborts with an error. (Note: it only concerns Spline initialization routine. If the error happens during lineage capturing, or in the Spline dispatcher, then the target Spark job have already been finished by that time, and the resulted data have been persisted, regardless of the spline.mode settings. The Spline agent doesn't do any automated rollbacks).
BEST_EFFORT (default) Spline will try to initialize itself, but if it fails it switches to DISABLED mode allowing the Spark application to proceed normally without Lineage tracking.

`spline.lineageDispatcher`

The logical name of the root lineage dispatcher. See Lineage Dispatchers chapter.

`spline.postProcessingFilter`

The logical name of the root post-processing filter. See Post Processing Filters chapter.

Lineage Dispatchers

The LineageDispatcher trait is responsible for sending out the captured lineage information. By default, the HttpLineageDispatcher is used, that sends the lineage data to the Spline REST endpoint (see Spline Producer API).

Available dispatchers:

HttpLineageDispatcher - sends the lineage via http
KafkaLineageDispatcher - sends the lineage via kafka
ConsoleLineageDispatcher - write the lineage to console
LoggingLineageDispatcher - logs the lineahge using logger
CompositeLineageDispatcher - allows combining multiple dispatchers

Each dispatcher can have different configuration parameters. To make the configs clearly separated each dispatcher has its own namespace in which all it's parameters are defined. I will explain it on an kafka examples.

Defining dispatcher

spline.lineageDispatcher=kafka

Once you defined the dispatcher all other parameters will have a namespace spline.lineageDispatcher.{{dipatcher-name}}. as a prefix. In this case it is spline.lineageDispatcher.kafka..

To find out which parameters you can use look into spline.default.properties. For kafka I would have to define at least these two properties:

spline.lineageDispatcher.kafka.topic=foo
spline.lineageDispatcher.kafka.producer.bootstrap.servers=localhost:9092

Creating your own dispatcher

There is also a possibility to create your own dispatcher. It must implement LineageDispatcher trait and have a constructor with a single parameter of type org.apache.commons.configuration.Configuration. To use it you must define name and class and also all other parameters you need. For example:

spline.lineageDispatcher=my-dispatcher
spline.lineageDispatcher.my-dispatcher.className=org.example.spline.MyDispatcherImpl
spline.lineageDispatcher.my-dispatcher.prop1=value1
spline.lineageDispatcher.my-dispatcher.prop2=value2

Post Processing Filters

Filters can be used to enrich the lineage with your own custom data or to remove unwanted data like passwords. All filters are applied after the Spark plan is converted to Spline DTOs, but before the dispatcher is called.

The procedure how filters are registered and configured is similar to the LineageDispatcher registration and configuration procedure. A custom filter class must implement za.co.absa.spline.harvester.postprocessing.PostProcessingFilter trait and declare a constructor with a single parameter of type org.apache.commons.configuration.Configuration. Then register and configure it like this:

spline.postProcessingFilter=my-filter
spline.postProcessingFilter.my-filter.className=my.awesome.CustomFilter
spline.postProcessingFilter.my-filter.prop1=value1
spline.postProcessingFilter.my-filter.prop2=value2

Use pre-registered CompositePostProcessingFilter to chain up multiple filters:

spline.postProcessingFilter=composite
spline.postProcessingFilter.composite.filters=myFilter1,myFilter2

(see spline.default.properties for details and examples)

Spark features coverage

Dataset operations are fully supported

RDD transformations aren't supported due to Spark internal architecture specifics, but they might be supported semi-automatically in the future Spline versions (see #33)

SQL dialect is mostly supported.

DDL operations are not supported, excepts for CREATE TABLE ... AS SELECT ... which is supported.

Note: the lineage is only captured on persistent (write) actions. In-memory only actions like collect() or printSchema() are ignored.

The following data formats and providers are supported out of the box:

Avro
Cassandra
COBOL
Delta
ElasticSearch
Excel
HDFS
Hive
JDBC
Kafka
MongoDB
XML

Although Spark being an extensible piece of software can support much more, it doesn't provide any universal API that Spline can utilize to capture reads and write from/to everything that Spark supports. Support for most of different data sources and formats has to be added to Spline one by one. Fortunately starting with Spline 0.5.4 the auto discoverable Plugin API has been introduced to make this process easier.

Below is the break down of the read/write command list that we have come through.
Some commands are implemented, others have yet to be implemented, and finally there are such that bear no lineage information and hence are ignored.

All commands inherit from org.apache.spark.sql.catalyst.plans.logical.Command.

You can see how to produce unimplemented commands in za.co.absa.spline.harvester.SparkUnimplementedCommandsSpec.

Implemented

CreateDataSourceTableAsSelectCommand (org.apache.spark.sql.execution.command)
CreateHiveTableAsSelectCommand (org.apache.spark.sql.hive.execution)
CreateTableCommand (org.apache.spark.sql.execution.command)
DropTableCommand (org.apache.spark.sql.execution.command)
InsertIntoDataSourceDirCommand (org.apache.spark.sql.execution.command)
InsertIntoHadoopFsRelationCommand (org.apache.spark.sql.execution.datasources)
InsertIntoHiveDirCommand (org.apache.spark.sql.hive.execution)
InsertIntoHiveTable (org.apache.spark.sql.hive.execution)
SaveIntoDataSourceCommand (org.apache.spark.sql.execution.datasources)

To be implemented

AlterTableAddColumnsCommand (org.apache.spark.sql.execution.command)
AlterTableChangeColumnCommand (org.apache.spark.sql.execution.command)
AlterTableRenameCommand (org.apache.spark.sql.execution.command)
AlterTableSetLocationCommand (org.apache.spark.sql.execution.command)
CreateDataSourceTableCommand (org.apache.spark.sql.execution.command)
CreateDatabaseCommand (org.apache.spark.sql.execution.command)
CreateTableLikeCommand (org.apache.spark.sql.execution.command)
DropDatabaseCommand (org.apache.spark.sql.execution.command)
LoadDataCommand (org.apache.spark.sql.execution.command)
TruncateTableCommand (org.apache.spark.sql.execution.command)

When one of these commands occurs spline will let you know.

When it's running in REQUIRED mode it will throw an UnsupportedSparkCommandException.
When it's running in BEST_EFFORT mode it will just log a warning.

Ignored

AddFileCommand (org.apache.spark.sql.execution.command)
AddJarCommand (org.apache.spark.sql.execution.command)
AlterDatabasePropertiesCommand (org.apache.spark.sql.execution.command)
AlterTableAddPartitionCommand (org.apache.spark.sql.execution.command)
AlterTableDropPartitionCommand (org.apache.spark.sql.execution.command)
AlterTableRecoverPartitionsCommand (org.apache.spark.sql.execution.command)
AlterTableRenamePartitionCommand (org.apache.spark.sql.execution.command)
AlterTableSerDePropertiesCommand (org.apache.spark.sql.execution.command)
AlterTableSetPropertiesCommand (org.apache.spark.sql.execution.command)
AlterTableUnsetPropertiesCommand (org.apache.spark.sql.execution.command)
AlterViewAsCommand (org.apache.spark.sql.execution.command)
AnalyzeColumnCommand (org.apache.spark.sql.execution.command)
AnalyzePartitionCommand (org.apache.spark.sql.execution.command)
AnalyzeTableCommand (org.apache.spark.sql.execution.command)
CacheTableCommand (org.apache.spark.sql.execution.command)
ClearCacheCommand (org.apache.spark.sql.execution.command)
CreateFunctionCommand (org.apache.spark.sql.execution.command)
CreateTempViewUsing (org.apache.spark.sql.execution.datasources)
CreateViewCommand (org.apache.spark.sql.execution.command)
DescribeColumnCommand (org.apache.spark.sql.execution.command)
DescribeDatabaseCommand (org.apache.spark.sql.execution.command)
DescribeFunctionCommand (org.apache.spark.sql.execution.command)
DescribeTableCommand (org.apache.spark.sql.execution.command)
DropFunctionCommand (org.apache.spark.sql.execution.command)
ExplainCommand (org.apache.spark.sql.execution.command)
InsertIntoDataSourceCommand (org.apache.spark.sql.execution.datasources) *
ListFilesCommand (org.apache.spark.sql.execution.command)
ListJarsCommand (org.apache.spark.sql.execution.command)
RefreshResource (org.apache.spark.sql.execution.datasources)
RefreshTable (org.apache.spark.sql.execution.datasources)
ResetCommand$ (org.apache.spark.sql.execution.command)
SetCommand (org.apache.spark.sql.execution.command)
SetDatabaseCommand (org.apache.spark.sql.execution.command)
ShowColumnsCommand (org.apache.spark.sql.execution.command)
ShowCreateTableCommand (org.apache.spark.sql.execution.command)
ShowDatabasesCommand (org.apache.spark.sql.execution.command)
ShowFunctionsCommand (org.apache.spark.sql.execution.command)
ShowPartitionsCommand (org.apache.spark.sql.execution.command)
ShowTablePropertiesCommand (org.apache.spark.sql.execution.command)
ShowTablesCommand (org.apache.spark.sql.execution.command)
StreamingExplainCommand (org.apache.spark.sql.execution.command)
UncacheTableCommand (org.apache.spark.sql.execution.command)

Developer documentation

Plugin API

Using a plugin API you can capture lineage from a 3rd party data source provider. Spline discover plugins automatically by scanning a classpath, so no special steps required to register and configure a plugin. All you need is to create a class extending the za.co.absa.spline.harvester.plugin.Plugin marker trait mixed with one or more *Processing traits, depending on your intention.

There are three general processing traits:

DataSourceFormatNameResolving - returns a name of a data provider/format in use.
ReadNodeProcessing - detects a read-command and gather meta information.
WriteNodeProcessing - detects a write-command and gather meta information.

There are also two additional trait that handle common cases of reading and writing:

BaseRelationProcessing - similar to ReadNodeProcessing, but instead of capturing all logical plan nodes it only reacts on LogicalRelation (see LogicalRelationPlugin)
RelationProviderProcessing - similar to WriteNodeProcessing, but it only captures SaveIntoDataSourceCommand (see SaveIntoDataSourceCommandPlugin)

The best way to illustrate how plugins work is to look at the real working example, e.g. za.co.absa.spline.harvester.plugin.embedded.JDBCPlugin

The most common simplified pattern looks like this:

package my.spline.plugin

import javax.annotation.Priority
import za.co.absa.spline.harvester.builder._
import za.co.absa.spline.harvester.plugin.Plugin._
import za.co.absa.spline.harvester.plugin._

@Priority(Precedence.User) // not required, but can be used to control your plugin precedence in the plugin chain. Default value is `User`.  
class FooBarPlugin
  extends Plugin
    with BaseRelationProcessing
    with RelationProviderProcessing {

  override def baseRelationProcessor: PartialFunction[(BaseRelation, LogicalRelation), ReadNodeInfo] = {
    case (FooBarRelation(a, b, c, d), lr) if /*more conditions*/ =>
      val dataFormat: Option[AnyRef] = ??? // data format being read (will be resolved by the `DataSourceFormatResolver` later)
      val dataSourceURI: String = ??? // a unique URI for the data source
      val params: Map[String, Any] = ??? // additional parameters characterizing the read-command. E.g. (connection protocol, access mode, driver options etc)

      (SourceIdentifier(dataFormat, dataSourceURI), params)
  }

  override def relationProviderProcessor: PartialFunction[(AnyRef, SaveIntoDataSourceCommand), WriteNodeInfo] = {
    case (provider, cmd) if provider == "foobar" || provider.isInstanceOf[FooBarProvider] =>
      val dataFormat: Option[AnyRef] = ??? // data format being written (will be resolved by the `DataSourceFormatResolver` later)
      val dataSourceURI: String = ??? // a unique URI for the data source
      val writeMode: SaveMode = ??? // was it Append or Overwrite?
      val query: LogicalPlan = ??? // the logical plan to get the rest of the lineage from
      val params: Map[String, Any] = ??? // additional parameters characterizing the write-command

      (SourceIdentifier(dataFormat, dataSourceURI), writeMode, query, params)
  }
}

Note: to avoid unwanted possible shadowing the other plugins (including the future ones), make sure that the pattern-matching criteria are as much selective as possible for your plugin needs.

A plugin class is expected to only have a single constructor. The constructor can have no arguments, or one or more of the following types (the values will be autowired):

SparkSession
PathQualifier
PluginRegistry

Compile you plugin and drop it into the Spline/Spark classpath. Spline will pick it up automatically.

Building for different Scala and Spark versions

There are several maven profiles that makes it easy to build the project with different versions of Spark and Scala.

Scala profiles: scala-2.11, scala-2.12
Spark profiles: spark-2.2, spark-2.3, spark-2.4, spark-3.0, spark-3.1

For example, to build an agent for Spark 2.4 and Scala 2.12:

# Change Scala version in pom.xml.
mvn scala-cross-build:change-version -Pscala-2.12

# now you can build for Scala 2.12
mvn clean install -Pscala-2.12,spark-2.4

References and examples

Although the primary goal of Spline agent is to be used in combination with the Spline server, it is flexible enough to be used in isolation or integration with other data lineage tracking solutions including custom ones.

Below is a couple of examples of such integration:

Copyright 2019 ABSA Group Limited

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

gadbees / spline-spark-agent