mitdbg / imputedb

A database with automatic dynamic imputation of missing values.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ImputeDB Build Status

ImputeDB is a SQL database which automatically imputes missing data on-the-fly. Users can issue SQL queries over data with NULL values and ImputeDB will use a regression model to fill in the missing values during the execution of the query. Designed to enable exploratory analysis of survey data, ImputeDB removes the cost of performing imputation manually, allowing users to get a quick and accurate view of their data.

Building and Running

To build ImputeDB, run:

cd simpledb; ant

To create a database from the demo collection of CSV files, run:

./imputedb load --db demo.db demo_data/*

This creates three tables:

  • demo: demographics data from the CDC
  • labs: laboratory data from the CDC
  • exams: physical examination data from the CDC

and places their serialized representations in the demo.db folder, along with a catalog describing the table schemas..

Then, to query the database, run:

./imputedb query --db demo.db

This launches the ImputeDB interpreter with an --alpha 0.0 parameter as a default, which means ImputeDB will optimize for data quality. You can modify this by calling the interpreter with the --alpha <double> option.

For example,

./imputedb query --db demo.db --alpha 1.0

launches an interpreter that optimizes for query execution speed.

Experiments

  1. Build the Docker container for the experiments.
cd simpledb/test/experiments
make build
  1. Run the experiments.

TODO.

Publications

Query Optimization for Dynamic Imputation. José Cambronero*, John K. Feser*, Micah J. Smith*, Samuel Madden. VLDB. (2017) To appear. [pdf]

*Authors contributed equally to this paper.

License

MIT

About

A database with automatic dynamic imputation of missing values.

License:MIT License


Languages

Language:Java 65.3%Language:TeX 21.5%Language:Jupyter Notebook 8.9%Language:Python 3.8%Language:Makefile 0.3%Language:Shell 0.1%Language:Julia 0.1%