dwm (Data Washing Machine)

Red Hat's business logic for maintaining marketing data quality

Introduction

Database quality is a problem for many companies. Often there is a mad dash to collect as much data as possible before a single thought is given to keeping that data high quality. Some examples of bad input data include:

Data collection tools (such as interest or freemium forms) that require manual input, rather than OAuth or picklists
Ill-trained data entry or office staff
Data purchased from outside sources that does not conform with company standards

Bad data introduced through these sources can lead to significant amounts of lost time invested in manual correction, or directly to lost opportunities and revenue due to not being able to query on clean data. Obviously, one solution is to make sure that all data collection sources conform with your database standards, but if you can make that happen, then I'll just be over here flying on my unicorn.

This package was originally developed for use by Red Hat's Marketing Operations group to maintain quality of contact data, although the principles are sound enough to apply to many types of databases.

Business logic

The following are what we have determined to be (as a best practice) the general rules available for cleaning a set of data. Theoretically, any string field can have these rules applied to them; however, when configuring DWM, one should evaluate whether or not a rule is appropriate for a given field.

Validation

Validation is the removal of data that is straight-up junk and provides no business value whatsoever. This data is usually the result of spam-bots, errors in collection tools (such as posting bad html strings), or someone uploading the wrong spreadsheet. We've split validation into two pieces: generic and field-specific.

Generic

Generic validation is the removal of data that has no place in a given database, no matter what field it is found in. The following are examples of generic bad data:

aaaaaaaaaaaaaaa ## one repeated character
fdsafdasfdfdsafdafdsa ## all typed with left hand on the home line
buy levitra cialis ## spam data
www.buymystuff.com ## marketing database-specific example; our processes do not try to clean any website/URL related fields, so if this appears in one of the fields we are cleaning, it's probably bad data

DWM uses two types of generic validation:

genericLookup: remove 'bad' values based on a known list of bad data (previously observed)
genericRegex: remove 'bad' values based on regular expressions; i.e., any word longer than 4 characters which is all the same character; or a string containing 'viagra'

Field-Specific

Field-Specific validation is the removal of data that is junk in one field, but good data in another. An example is a string of nine numbers: 9493020093. In a phone number field, this is probably good data. In the First Name field, it's junk. Conversely, hi, this is a string of letters may have a purpose in a text-based field, but it provides no reasonable data in a phone number field.

DWM uses two types of field-specific validation:

fieldSpecificLookup: remove 'bad' values based on a known list of bad data (previously observed)
fieldSpecificRegex: remove 'bad' values based on regular expressions; i.e., for the field firstName, remove any value containing all numbers

Normalization

Normalization is the correction of data to conform with a certain expected set of values. For example, Proggrrammer/Developer is almost a valid "Job Role" value Programmer/Developer, but is mis-spelled. Another example is Programmer, which obviously would fall into the previous category, but is not an exact value match.

Note that normalization usually cannot be applied to fields that are expected to be free-text, such as "First Name" or "Company Name". If certain rules need to be applied to those fields, use of the User-Defined Functions is recommended.

DWM uses two types of normalization:

normLookup: replace 'almost' values based on a known list of data (previously observed); i.e., common mis-spellings
normRegex: replace 'almost' values based on regular expressions; i.e., for the field jobRole, replace any value that contains programmer but not manager with Programmer/Developer
normIncludes: replace 'almost' values based on at least one of the following: includes strings, excludes strings, starts with string, ends with string

Derivation

Derivation is the management of fields designed to help business users more easily make decisions, but are not explicitly collected. One example is a "Super Region" field. Although "Country" is a value that is likely collected, users of the database may only need to filter to a general region to do their jobs.

DWM uses three types of derivation:

deriveValue: given input values from one or more fields, find the corresponding output value; i.e., for jobRole='Manager' and department='IT', then set persona='IT Decision Maker'
copyValue: given an input value from one field, copy that value to the target field
deriveRegex: given an input value from one field, derive target field value using regular expressions
deriveIncludes: given an input value from one field, derive target field based on at least one of the following: includes strings, excludes strings, starts with string, ends with string

Within the runtime configuration, derivation rules are ordered within a dictionary to maintain rule-hierarchy. So, if Rule 1 does not yield a result, then Rule 2 would be tried. The process will exit after one of the derivation rules produces a new value.

User-Defined Functions

The above three processes are the most common processes that need to be applied to data, but not everything can be planned for. User-Defined functionality is designed to fill this gap. For instance, US Zipcodes can have some fairly basic and consistent transformations applied to make the data easier to work with (i.e., strip off trailing hypen/number combos, left pad with 0s in case of bad spreadsheet formatting), but those rules don't fall into any of the above categories.

Also included may be third-party data enrichment. For example, if you have an API contract with a company that provides IP address geolocation, or provides additional company info based on email domain, you can define a function to interact with that API and pull additional data into the fields of interest.

Order

We've found this to be the most efficient order in which to run the above cleaning types.

Generic Validation
Field-specific Validation
Normalization
Derive Data (aka "Fill-in-the-gaps", depending on field type)

Audit History

Record-level audit history is a record of what changes were made to which data fields. This includes what the previous value was, what the new/replacement value was, and what rule caused the change. The record is somewhat akin to a git commit, in that it only records where changes were made, and does not keep a record of anything that remained unchanged. Although it is optional in this package, it is recommended for any automation of these processes to provide both a record for troubleshooting and transparency for the business users of the database.

Architecture

Data Flow

Data is gathered for cleaning by the Python script utilizing the DWM package (i.e., using an API to export contact data from a Marketing Automation Platform)
Import custom functions (if applicable)
Connect to MongoDB using pymongo MongoClient
Data is passed (as a list of dictionaries), along with a configName and MongoDB connection, to the dwmAll function
Script takes post-processing action (i.e., using an API to import the cleaned data back into a Marketing Automation Platform)

dwmAll

This function is the highest-level wrapper for all DWM functions.

Use configName to retrieve config document from MongoDB
Apply sorting to relevant parts of config (derive and userDefinedFunctions)

Do this because you can't store a Python OrderedDict in MongoDB, and the order in which some rules are applied can be important

Loop through data, passing each record to dwmOne along with config and MongoDB collection
If configured to write history and return the history ID, append the _id to each record
Return list of dictionaries with cleaned data

dwmOne

This function applies wrapper functions to each data record. It follows the specification above, in Business Logic: Order.

Create a history collector {}
Run userDefinedFunctions=beforeGenericValidation
Run lookupAll with lookupType='genericLookup'
Run userDefinedFunctions=beforeGenericRegex
Run lookupAll with lookupType='genericRegex'
Run userDefinedFunctions=beforeFieldSpecificValidation
Run lookupAll with lookupType='fieldSpecificLookup'
Run userDefinedFunctions=beforeFieldSpecificRegex
Run lookupAll with lookupType='fieldSpecificRegex'
Run userDefinedFunctions=beforeNormalization
Run lookupAll with lookupType='normLookup'
Run userDefinedFunctions=beforeNormalizationRegex
Run lookupAll with lookupType='normRegex'
Run userDefinedFunctions=beforeNormalizationIncludes
Run lookupAll with lookupType=normIncludes
Run userDefinedFunctions=beforeDeriveData
Run DeriveDataLookupAll
Run userDefinedFunctions=afterProcessing
If writeContactHistory==True, write the history collector to the contactHistory collection in MongoDB
Return data record and history ID (if applicable, None otherwise)

Wrapper functions

These functions are responsible for applying all the specified cleaning functions to every field in the input record, based on the given config.

`lookupAll`

This function applies a single cleaning function+type to every field in the input record, based on the given config. Also, since this function calls lookup functions that are based on the current value of a field, for performance it skips fields that have blank values.

Loop through each field in the record
If the field value is not blank and the field name is in the config, then proceed
If the config value for the field contains the current lookupType, then pass to the appropriate function:

'genericLookup', 'fieldSpecificLookup', 'normLookup': DataLookup
'genericRegex', 'fieldSpecificRegex', 'normRegex': RegexLookup
normIncludes: IncludesLookup with lookupType='normIncludes'

Functions in 3 return the new field value (potentially same as the original value, if no match was found) and an updated history object
Set the field value in the data record to return value from 4
Return data record and history object

`DeriveDataLookupAll`

This function applies all defined derive rules to every field in the input record, based on the given config.

Loop through each field in the record
If the field name is in the config, then proceed
Loop through the derive configs for the current field (this is an OrderedDict which was sorted by dwmAll)
If all the specified fields for a derive rule are present in the record, then proceed
Apply the following based on the type specified in the config:

DeriveDataLookup
DeriveDataCopyValue
DeriveDataRegex
IncludesLookup with lookupType='deriveIncludes'

Functions in 5 return the new field value (potentially same as the original value, if no match was found) and an updated history object
If the field value has changed, then update the field in the record and stop the loop
Return data record and history object

Cleaning functions

These functions are responsible for determining what the new value of a field should be, in most cases based on a lookup against MongoDB.

`DataLookup`

Lookup the replacement value given a single input value from the same field.

`IncludesLookup`

Query all applicable "includes" definitions (by field name) from MongoDB and look for a match.

`RegexLookup`

Query all applicable regex (generic, or match on field name) from MongoDB and look for a match.

`DeriveDataLookup`

Lookup replacement value given one or more input values from different fields.

`DeriveDataCopyValue`

Copy a value from one field to another.

`DeriveDataRegex`

Query all applicable regex (match on field name) from MongoDB and look for a match.

Helpers

These are misc functions that are used throughout but don't fit in one general category.

`_CollectHistory_`

Creates a basic dictionary of what change, if any, was applied to a record field.

`_CollectHistoryAgg_`

Updates an existing history dictionary with the result of _CollectHistory_, if applicable.

`_DataClean_`

Applies cleaning rules to lookup values before querying MongoDB, to ensure small differences don't screw it up.

Convert to uppercase
Strip leading and trailing whitespace
Remove line breaks, carriage returns, and non-visible characters

`_RunUserDefinedFunctions_`

Passes data and history into functions defined by configuration.

Setup Process

Hosting

Local machine: it's entirely possible to run this complete process on your individual laptop/desktop, although may not be recommended due to backup and business continuity risks.
PaaS: Platform-as-a-Service is the recommended route to get up-and-running quickly. This way, developers don't have to worry about the engineering concerns of making sure their services remain running. We're using Red Hat's Openshift, but there are other options (such as Heroku) available on AWS. Be warned that the PaaS may have to be internally hosted at your workplace to ensure connectivity to internal databases. You should also be aware of potential security concerns around PII, especially if you're storing record history, and may need to work with your IT team to ensure secure storage/transit for such data.

Python

Python 2.7 is the recommended minimum, although a 3.x release is advisable if unicode support is required. This package is tested to work with Python 2.7, 3.3, 3.4, and 3.5.

MongoDB

MongoDB is required for persistent storage of runtime configurations, lookup tables, regex rules, and derivation rules. It also serves as an (optional, but recommended) home for record-level audit history. Also, since operational data will be stored here, you should have some sort of routine backup process in place. Exact description of the schema is included in the DataDictionary.

This package was designed with MongoDB 3.2.x, but due to multi-key indexing requirements at least 2.5.5 is advised.

Configuration

Runtime configuration for DWM is stored in a JSON document within MongoDB. It is retrieved by the unique "configName" field when the dwmAll function is called, and dictates which fields are cleaned, what types of lookups, regexes and derivation rules are called, and which user-defined functions should be called. Multiple configurations can be stored and called for different purposes. For example, a configuration for use directly against a database may include rules for 20 fields, while one running within an API may only run against five fields.

Full example is given in the DataDictionary.md file.

Required Fields:

configName: Must be a unique string
fields: Includes a document for each field to be cleaned; each should include the following:

lookup: an array of which validation rules should be applied: genericLookup, genericRegex, fieldSpecificLookup, fieldSpecificRegex, normLookup, normRegex, normIncludes
derive: a document of documents, each named in order of execution (1,2,...) and containing the following sub-fields:
- type: string indicating what type of derivation should be applied: deriveValue, copyValue, deriveRegex, deriveIncludes
- fieldSet: array of field names to be used in derive process. Must contain only one value if type==copyValue OR deriveRegex OR deriveIncludes.
- overwrite: boolean indicating whether to write over an existing value
- blankIfNoMatch: overwrite existing value with a blank value if no match found

userDefinedFunctions: document of the following sub-documents with ordered numeric names, indicating when user-defined functions should be run: beforeGenericValidation, beforeGenericRegex, beforeFieldSpecificValidation, beforeFieldSpecificRegex, beforeNormalization, beforeNormalizationRegex, beforeNormalizationIncludes, beforeDeriveData, afterProcessing
history: settings dictating if/how to write contact history

Lookups, Derivation, and Regex rules

A complete schema for these items is in the DataDictionary.md file. Also included is a recommendation for indexes to improve performance.

User-Defined Functions

User-Defined functions must take exactly two inputs, data (a single dictionary of data to which transformations are applied) and histObj (a dictionary object used to record field-level changes), and output the same two (with changes applied to data and any relevant updates made to histObj). Helper functions for recording history are included in the dwm package.

UDFs should ideally be defined in a file separate from the script calling the DWM functions, then loaded in independently. If using UDFs, then the dwmAll parameter must be set udfNamespace=__name__

Example:

udf.py

from dwm import _CollectHistory_, _CollectHistoryAgg_
def myFunction(data, histObj):

  fieldOld = data['myField']

  fieldNew = 'Hi! This is a data change'

  data['myField'] = fieldNew

  change = _CollectHistory_(lookupType='UDF-myFunction', fromVal=fieldOld, toVal=fieldNew) ## recommended format for lookupType: "UDF-nameOfFunction"

  histObjUpd = _CollectHistoryAgg_(contactHist=histObj, fieldHistObj=change, fieldName='myField')

  return data, histObj

example.py

from dwm import dwmAll
from udf import myFunction

### get data to run through, define mongoDb collections and connection, etc

dataOut = dwm.dwmAll(data=data, mongoDb=db, mongoConfig=mongoConfig, configName='myConfig', returnHistoryId=False, udfNamespace=__name__)

Application

dwm is just a Python package which applies business logic. It still requires scripting and configuration to actually apply it to data. Here is our implementation using Openshift:

https://github.com/rh-marketingops/dwmops

dwm (Data Washing Machine)

Introduction

Business logic

Validation

Generic

Field-Specific

Normalization

Derivation

User-Defined Functions

Order

Audit History

Architecture

Data Flow

dwmAll

dwmOne

Wrapper functions

lookupAll

DeriveDataLookupAll

Cleaning functions

DataLookup

IncludesLookup

RegexLookup

DeriveDataLookup

DeriveDataCopyValue

DeriveDataRegex

Helpers

_CollectHistory_

_CollectHistoryAgg_

_DataClean_

_RunUserDefinedFunctions_

Setup Process

Hosting

Python

MongoDB

Configuration

Lookups, Derivation, and Regex rules

User-Defined Functions

Application

About

Languages

`lookupAll`

`DeriveDataLookupAll`

`DataLookup`

`IncludesLookup`

`RegexLookup`

`DeriveDataLookup`

`DeriveDataCopyValue`

`DeriveDataRegex`

`_CollectHistory_`

`_CollectHistoryAgg_`

`_DataClean_`

`_RunUserDefinedFunctions_`