byapparov / rdqa

Data Quality Manager for R

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Build Status codecov.io

Data Quality package simplifies data validation and logging of errors to the database.

Metadata Schema

Field Suggested Type Description
Date Timestamp Time when error was logged
Source String Data source of the issue
Type String Type of the issue (enum)
Value String Value that contains the problem
URL String Link to the source record

Error types

Type Description
Orphan Foreign key does not match the primary key
Wrong Value does not match the business rule
Missing Value in the field is empty
Duplicate Duplicated record

Example

Code

library(rdqa)
# Defining the rules - here we don't need the data yet
r.smaller <- newConditionRule("small", condition = "< big")
r.unique <- newUniqueRule("big")
r.name.required <- newRequiredRule("name")

# Combine the rules in Rules Container
all.rules <- newRulesContainer("test.data.td", r.smaller, r.unique, r.name.required)

# Connect to the db where we want to log the results
conn <- dbConnect(dbDriver("SQLite"), "demo.db")

# We only need data just before the validation
dt <- data.table(id = c(1, 2, 3, 4, 5),
                 small = c(1, 10, 2, 3, 4),
                 big = c(10, 1, 20, 20, 40),
                 name = c("a", "", "c", "d", NA_character_),
                 key = "id")

# This call will validate the data againt all three rules
# and log results to the database
validateRules(conn, all.rules, dt)

# Check what we have in the errors table
res <- dbGetQuery(conn, "SELECT * FROM errors")

Expected result

date source type rule ref value url
1486681846 test.data.td Condition Field [small] should matrch condition: < big 2 2 NA
1486685164 test.data.td Unique Field(s) big should be uqniue 4 4 NA
1486685164 test.data.td Required Field [name] should ont be empty 2 2 NA
1486685164 test.data.td Required Field [name] should ont be empty 5 5 NA

Defining rules through Schema object

Schema object allows to define rules in a more readable layout with rules seating inside the data structure:

# Lets say we want to define rules for customer data, here is a sample schema:

schema.customers <- Schema(
"customer.data",
  schema = list(
    list(
      name = "id",
      description = "This is an integer primary key for our customer table",
      class = "integer", 
      required = TRUE,
      unique = TRUE
    ),
    list(
      name = "name",
      class = "character",
      regex = "\\w"
    ),
    list(
      name = "gender",
      class = "character",
      enum = c("male", "female")
    )
  ),
  rules = list(
    newConditionRule("id", "> 0"),
    newConditionRule("name", condition = expression(nchar(name) < 12))
  )
) 

# These are our customers:
customers <- data.table(
  id = c(1L, 2L, NA_integer_, 3L, 4L, -1L),
  name = c("John", "Isabellarose", "Anna", "Bob", NA_character_, ""),
  gender = c("male", "other", "female", "female", "male", "male"),
  key = "id"
)

# Validate rules and log problems
errors <- validate(schema.customers, customers)
print(errors)

Errors table for this example will have:

n ref value type
1: -1 -1 Condition
2: 2 Isabellarose Condition
3: NA NA Required
4: -1 Regex
5: 2 other Enum

This specification can be used within ETL or data import procedure to identify records with erros.

It will also raise error if column names or types don't match the schema. You can also stop execution based on the records in errors ouput. e.g.:

assert_that(nrow(errors) == 0)

Once you have schemas set up for your data processes it is quite easy to add monitoring suite using errors table.

About

Data Quality Manager for R

License:Other


Languages

Language:R 100.0%