Quality
Run high performance complex Data Quality and data processing rules using simple SQL in a batch or streaming Spark application at scale.
Write rules using simple SQL or create re-usable functions via SQL Lambdas - your rules are just versioned data, store them wherever convenient, use them by simply defining a column.
Rules are evaluated lazily during Spark actions, such as writing a row, with results saved in a single predictable and extensible column.
The documentation site https://sparkutils.github.io/quality/ breaks down the reason for Quality's existence, and it's usage.
What's it written in?
Scala with sprinklings of java for WholeStageCodeGen optimisations.
How do I use / build it?
For oss with Spark 3.4.1 use properties:
<properties>
<qualityRuntime>3.4.1.oss_</qualityRuntime>
<scalaCompatVersion>2.12</scalaCompatVersion>
<sparkShortVersion>3.4</sparkShortVersion>
<qualityVersion>0.1.2.1</qualityVersion>
<snakeVersion>1.33</snakeVersion>
</properties>
with dependency:
<dependency>
<groupId>com.sparkutils</groupId>
<artifactId>quality_${qualityRuntime}${qualityShortVersion}_${scalaCompatVersion}</artifactId>
<version>${qualityVersion}</version>
</dependency>
<!-- Only required if expressionRunner, to_yaml or from_yaml are used. Provided scope if running on Databricks 1.24 on 12.2 and lower, 1.33 on 13.1 and higher -->
<dependency>
<groupId>org.yaml</groupId>
<artifactId>snakeyaml</artifactId>
<version>${snakeVersion}</version>
<scope>provided</scope>
</dependency>
The qualityRuntime variable also supports further runtime types, such as Databricks.
See the docs site build-tool-dependencies section for detailed instructions.