se-jaeger / conformal-data-cleaning

Code for the AISTATS 2024 Paper "From Data Imputation to Data Cleaning - Automated Cleaning of Tabular Data Improves Downstream Predictive Performance"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Conformal Data Cleaning

This repository contains source code for the experiments conducted in the AISTATS 2024 paper From Data Imputation to Data Cleaning - Automated Cleaning of Tabular Data Improves Downstream Predictive Performance.

Run Experiments

First of all, use load_corrupt_and_test_datasets.ipynb to download and corrupt the datasets and setup the expected structure of the data directory.

run_experiment.py implements a simple CLI script (run-experiment), which allows to easily run experiments.

Conformal Data Cleaning:

run-experiment \
	--task_id \
	"42493" \
	--error_fractions \
	"0.01" \
	"0.05" \
	"0.1" \
	"0.3" \
	"0.5" \
	--num_repetitions \
	"3" \
	--results_path \
	"/conformal-data-cleaning/results/final-experiments" \
	--models_path \
	"/conformal-data-cleaning/models/final-experiments" \
	--how_many_hpo_trials \
	"50" \
	experiment \
	--confidence_level \
	"0.999"

ML Baseline:

run-experiment \
	--task_id \
	"42493" \
	--error_fractions \
	"0.01" \
	"0.05" \
	"0.1" \
	"0.3" \
	"0.5" \
	--num_repetitions \
	"3" \
	--results_path \
	"/conformal-data-cleaning/results/final-experiments" \
	--models_path \
	"/conformal-data-cleaning/models/final-experiments" \
	--how_many_hpo_trials \
	"50" \
	baseline \
	--method \
	"AutoGluon" \
	--method_hyperparameter \
	"0.999"

PyOD Baseline (not included in the paper):

run-experiment \
	--task_id \
	"42493" \
	--error_fractions \
	"0.01" \
	"0.05" \
	"0.1" \
	"0.3" \
	"0.5" \
	--num_repetitions \
	"3" \
	--results_path \
	"/conformal-data-cleaning/results/final-experiments" \
	--models_path \
	"/conformal-data-cleaning/models/final-experiments" \
	--how_many_hpo_trials \
	"50" \
	baseline \
	--method \
	"PyodECOD" \
	--method_hyperparameter \
	"0.3"

For Garf, please use main.py.

python \
	main.py \
	--task_id \
	"42493" \
	--error_fractions \
	"0.01" \
	"0.05" \
	"0.1" \
	"0.3" \
	"0.5" \
	--num_repetitions \
	"3" \
	--results_path \
	"/conformal-data-cleaning/results/final-experiments" \
	--models_path \
	"/conformal-data-cleaning/models/final-experiments"

Run our Experimental Setup

We ran our experiments on Kubernetes using Helm. Please checkout the helm charts and change the image and imagePullSecrets settings in the values.yaml files accordingly to your setup. Therefore, some read-write-many volumes are necessary to store the experiment results. Please checkout the infrastructure/k8s directory (and don't forget to setup the data directory as describe above).

Using make docker builds and pushes the necessary docker images and make helm-install uses deploy_experiments.py to start our experimental setup.

Evaluation

notebooks/evaluation contains notebooks we use for evaluating the results and 5_plotting.ipynb outputs the plots shown in the paper.

About

Code for the AISTATS 2024 Paper "From Data Imputation to Data Cleaning - Automated Cleaning of Tabular Data Improves Downstream Predictive Performance"

License:MIT License


Languages

Language:Python 64.7%Language:Jupyter Notebook 33.4%Language:Smarty 1.4%Language:Makefile 0.4%Language:Dockerfile 0.1%