kwanUm / DataQuality

Tutorial and examples of Data Quality in Big Data System

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DataQuality

Tutorial and examples of Data Quality in Big Data System.

Data Quality metrics:

  • completeness
  • commission
  • omission
  • thematic accuracy
  • thematic classification correctness
  • non-quantitative attribute correctness
  • qualintitative attribute accuracy
  • logical consistency
  • conceptual consistency
  • domain consistency
  • format consistency
  • topological consistency
  • temporal quality
  • accuracy of a time measurement
  • temporal consistency
  • temporal validity
  • positional accuracy
  • absolute external positional accuracy
  • relative internal positional accuracy
  • gridded data positional accuracy
  • usability

Your contributions are always welcome!

  • Griffin - Data Quality solution for distributed data systems at any scale in both streaming and batch data context. Detect accuracy, Completeness, Validity, Timeliness, Anomaly detection and Data Profiling. (Recommended)
  • drunken-data-quality - provide data quality report using spark,Elasticsearch, Logstash and Kibana (ELK) and demo (https://github.com/FRosner/ddq-demo-elk)
  • DataQuality for BigData - a framework to build parallel and distributed quality checks on big data environments. It can be used to calculate metrics and perform checks to assure quality on structured or unstructured data. It relies entirely on Spark.
  • TopNotch - TopNotch is a system for quality controlling large scale data sets. It addresses the following three problems:How to define and measure data quality , How to efficiently ensure data quality across many data sets, How to institutionalize existing knowledge of data sets.
  • Phasor Data Quality Tracker - The PDQ Tracker administered by the Grid Protection Alliance (GPA) is a high-performance, real-time data processing engine designed to raise alarms, track states, store statistics, and generate reports on both the availability and accuracy of streaming synchrophasor data. [doc] (http://www.gridprotectionalliance.org/docs/products/PDQTracker/highlevelrequirements.pdf)
  • DataCleaner - The premier open source Data Quality solution Documentation
  • data-quality - Talend Open Studio for Data Quality can be download from the Talend website.

About

Tutorial and examples of Data Quality in Big Data System