Daniel Del Guido (University of Stuttgart), Tim Schubert (University of Stuttgart), and Mohamed Abdelaal (Software AG), RTClean: Context-aware Tabular Data Cleaning using Real-time OFDs. Published at the 2023 IEEE Int. Conf. on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops)
This is a prototypical implementation of a concept to extract dependencies from an ontology to improve existing error detection methods. Those dependencies include new types upon existing Denial Dependency and Matching Dependency:
- Device-Link Dependency
- Temporal Dependency
- Locality Dependency
- Monitoring Dependency
- Capability Dependency
Contains two datasets: Hospital and IoT:
- Hospital is a commonly used benchmark dataset from the US Health Service and already contains typos.
- IoT is self-collected via a Smart Home Application with four temperature sensors. To inject errors use
inject.py
- Definition of dependencies as python classes
- Definition of a ontology loader to parse from a file or a database
- Contains SPARQL Queries to extract dependencies from ontologies
Only used for plotting results in a matplotlib figure
Results of evaluation done along the master thesis
Only used for parsing tests of ontologies as file or database
Contains runnables to test data validation:
- Execution of HoloClean with Hospital dataset
- Execution of HoloClean with IoT dataset
- Execution of Raha with IoT dataset
- Injection of outliers in IoT dataset
- Execution of dBoost with outlier IoT dataset
Ontologies are used to extract dependencies in the context of the data and find relations. These are evaluated for the usage in the further pipeline
The concept is implemented to work with HoloClean and Raha. The HoloClean framework is enhanced with the extracted information to improve its error detection capabilities.
Hint: You need to build error-generator. Change every occurences of "get_values()" to "values" since it is deprecated in pandas, but was not updated in this project.
Hint:
If you are running Python 3.8 and above you need to change all occurences of time.clock()
to time.time()
. This is a known issue of HoloClean.
- Python (Version 3.8.x)
- rdflib
- pyfuseki
- pandas