Gerrrr/drunken-data-quality

Drunken Data Quality (DDQ)

Description

DDQ is a small library for checking constraints on Spark data structures. It can be used to assure a certain data quality, especially when continuous imports happen.

Getting DDQ

In order to use DDQ, you can add it as a dependency to your project using JitPack.io. Just add it to your build.sbt like this:

resolvers += "jitpack" at "https://jitpack.io"

libraryDependencies += "com.github.FRosner" % "drunken-data-quality" % "x.y.z"

If you are not using any of the dependency management systems supported by JitPack, feel free to download one of the compiled artifacts in the release section. Alternatively you may of course also build from source.

Using DDQ

import de.frosner.ddq._

val customers = sqlContext.table("customers")
val contracts = sqlContext.table("contracts")
Check(customers)
  .hasNumRowsEqualTo(100000)
  .isNeverNull("customer_id")
  .hasUniqueKey("customer_id")
  .satisfies("customer_age > 0")
  .isConvertibleToDate("customer_birthday", new SimpleDateFormat("yyyy-MM-dd"))
  .hasForeignKey(contracts, "customer_id" -> "contract_owner_id")
  .run()

Authors

Frank Rosner (Creator)
Slavo N. (Contributor)

License

This project is licensed under the Apache License Version 2.0. For details please see the file called LICENSE.

Gerrrr / drunken-data-quality