Gerrrr / drunken-data-quality

Some utility classes for checking data quality in Spark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Drunken Data Quality (DDQ) Build Status Coverage Status

Description

DDQ is a small library for checking constraints on Spark data structures. It can be used to assure a certain data quality, especially when continuous imports happen.

Getting DDQ Latest Release

In order to use DDQ, you can add it as a dependency to your project using JitPack.io. Just add it to your build.sbt like this:

resolvers += "jitpack" at "https://jitpack.io"

libraryDependencies += "com.github.FRosner" % "drunken-data-quality" % "x.y.z"

If you are not using any of the dependency management systems supported by JitPack, feel free to download one of the compiled artifacts in the release section. Alternatively you may of course also build from source.

Using DDQ

import de.frosner.ddq._

val customers = sqlContext.table("customers")
val contracts = sqlContext.table("contracts")
Check(customers)
  .hasNumRowsEqualTo(100000)
  .isNeverNull("customer_id")
  .hasUniqueKey("customer_id")
  .satisfies("customer_age > 0")
  .isConvertibleToDate("customer_birthday", new SimpleDateFormat("yyyy-MM-dd"))
  .hasForeignKey(contracts, "customer_id" -> "contract_owner_id")
  .run()

Authors

License

This project is licensed under the Apache License Version 2.0. For details please see the file called LICENSE.

About

Some utility classes for checking data quality in Spark

License:Apache License 2.0


Languages

Language:Scala 100.0%