daqual

A data quality framework for python

Antiscope

Daqual is a framework only for measuring and assessing data quality in a data-handling system. It does not involve itself in the complexities of what to do with data files or objects once a quality measure has been determined (e.g. alert, re-run, fix-data, etc).

High level model and key concepts

Daqual is a framework that encourages reuse and composability, made up of functions that each perform a trivial quality measure/assessment. The power of the framework emerges from the composability and reuse of these simple functions.

Each quality check/function returns a score (of between 0 and 1) for quality; 1 being perfect, 0 being failure. The semantic intent of the quality score may vary between functions, in some cases representing a "percentage", say, of an expected value, or with other predetermined values in the interval (0,1) representing particular levels or measures of quality.
Although a number of quality scoring functions are provided, it is intended that custom functions can be easily written and used in the same framework.
Quality scoring functions can call other functions, aggregating or combining the scores derived from called subordinate functions.
Quality functions operate upon objects (typically file objects or objects retrieved from an object store, e.g. AWS S3).
Multiple objects and quality functions can be assessed in aggregate and as an atomic unit, returning both an overall quality score, individual scores for particular tests, as well as individual or aggregate "pass or fail" measures, defined by a threshold for each test or for the entire set.
Aggregate scoring can also take advantage of "weighting" of individual tests, to ensure that some tests are treated as more or less significant than others.
Once an object has been assessed, it can be tagged with a record of it's quality score.
Scoring functions can take other objects as parameters to allow quality measures to be determined with reference to other data sources, e.g. to ensure mutual consistency between objects.
Different object-stores are implemented by use of the concept of provider. Providers for AWS S3 and local filesystem are implemented by default, but other providers can be trivially implemented and incorporated.

Table of Implemented/Proposed Quality Checks

Here are a list of data quality functions implemented in Daqual core, that are needed to implement a basic framework which can then be combined to provide us with the overall functionality we desire.

Description	Function Name	Parameters	Notes
Expected number of columns	score_column_count	column, expected_n
Expected column names	score_column_names	columns	returns either 0 or 1
Are blanks allowed in a column?	score_no_blanks	column	returns either 0 or 1
Is a column constrained to a fixed set of values?	score_column_valid_values	column, master, master_column	returns % of values that are present in master list
Is every single value from a master table used?	score_every_master_value_used	column, master, master_column	returns % of values that are used from a master list
Are all the values in a column unique?	score_unique_column	column	returns either 0 or 1
Column matches a regex	score_match	match, column
Column is an integer	score_int	column	returns either 0 or 1; ignores NaN
Column is a date
Column is a float	score_float	column	returns either 0 or 1; ignores NaN
Column is a number	score_number	column	returns either 0 or 1; ignores NaN
Expected sum of a column (including group-by)
Expected mean of a column (including group-by)
Row count	expected_row_count	expected_row_count or comparison, expected_delta
Max of a column (including group-by)
Min of a column (including group-by)
Compare to another version of a file
Are all values greater than?
Are all values less than?
Number of different values in a column

Note: parameters are supplied in a dict

See source code for detailed explanation of how to use each scoring function

Definitions/explanations of common parameters:

column
master
master_column
comparison
expected
expected_row_count
expected_n
match

Example Testset

A test set is defined as an array, with each element being structured as follows: [object_name, scoring_function, function parameter object/dict, weight, threshold]

Example:

test_set=[
    ['daqual/ExampleCSV.csv',daqual.score_column_count,{"expected_n":8},1,1],
    ['daqual/ExampleCSV.csv',daqual.score_row_count,{"expected_row_count":5},1,1],
    ['daqual/ExampleCSV.csv',daqual.score_column_names,{'columns':['Column 1', 'Second Column', 'Column 3', 'Optional Column 1',
                                            'Optional Column2', 'GroupByColumn','Integer Column', 'Float Column']},1,1],
    ['daqual/ExampleCSV.csv',daqual.score_no_blanks,{'column':'Second Column'},1,1],
    ['daqual/ExampleCSV.csv',daqual.score_float,{'column':'Float Column'},1,1],
    ['daqual/groups.csv',daqual.score_column_count,{"expected_n":1},1,1],
    ['daqual/ExampleCSV.csv',daqual.score_column_valid_values,{'column':'GroupByColumn', 'master':'daqual/groups.csv',
                                                    'master_column':'groups'},1,1],
    ['daqual/ExampleCSV.csv', daqual.score_unique_column, {'column': 'GroupByColumn'}, 1, 1],
    ['daqual/groups.csv',daqual.score_unique_column,{'column':'groups'},1,1],
    ['daqual/groups.csv',daqual.score_row_count,{'expected_row_count':3},1,1],
    ['daqual/groups.csv',daqual.score_match,{'column':'groups', 'match':'Group\d'},1,1],
    ['daqual/ExampleCSV-2.csv',daqual.score_column_count,{"expected_n":8},1,1],
    ['daqual/ExampleCSV.csv',daqual.score_row_count,{"comparison":'daqual/ExampleCSV-2.csv', 'expected_delta':1},1,1],
]

lukySmile / daqual