maianhpuco / DIMVImputation

The code base for paper "Conditional expectation with regularization for missing data imputation"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Conditional expectation with regularization for missing data imputation (DIMV)

This is an imputation package for missing data, which can be easily installed with pip.

The code repository associated with the paper: "Conditional expectation with regularization for missing data imputation." This paper is under evaluation for the journal you can find it at https://arxiv.org/abs/2302.00911

1. Introduction

Conditional Distribution-based Imputation of Missing Values with Regularization (DIMV): An algorithm for imputing missing data with low RMSE, scalability, and explainability. Ideal for critical domains like medicine and finance, DIMV offers reliable analysis, approximated confidence regions, and robustness to assumptions, making it a versatile choice for data imputation. DIMV is under the assumption that it relies on the normally distributed assumption as part of its theoretical foundation. The assumption of normality is often used in statistical methods and imputation techniques because it simplifies data modeling.

Comparision

In this comparison, we evaluate DIMV's performance on both small datasets with randomly missing data patterns and medium datasets (MNIST and FashionMNIST) with monotone missing data patterns (cutting a piece of the image on the top right).

Randomly Missing Pattern

For small datasets with random missing data: image4

Monotonic missing pattern

For medium datasets (MNIST and FashionMNIST):

image5

Here's an illustration of DIMV's imputation for MNIST and FashionMNIST:

image1 image2

DIMV has shown promising performance in terms of computational efficiency and robustness across small to medium datasets, accommodating a variety of missing data patterns. However, like many imputation methods, DIMV may face challenges with computational time when dealing with large datasets or high-dimensional data. For instance, popular imputation methods like k-nearest Neighbors Imputation (KNNI) can sometimes encounter performance issues in these scenarios.

2. Repository Contents

The codes are structured as follows:

.
├── README.md
├── example.ipynb
├── requirements.txt
└── src
    ├── DIMVImputation.py
    ├── __init__.py
    ├── conditional_expectation.py
    ├── dpers.py
    └── utils.py 

In /src folders:

  • DIMVImputation.py implements the DIMV imputation algorithm for imputing missing data.
  • dpers.py that implements the DPER algorithm for computing the covariance matrix used in the DIMV (Conditional expectation with regularization for missing data imputation) algorithm. (input is a normalized input matrix).
  • conditional_expectation.py contains the computation for the regularized conditional expectation for a sliced position in the dataset, given the covariance matrix.

example.ipynb is a Jupyter Notebook file that contains examples demonstrating how to use the functionalities and methods.

3. Installation

Option 1: Install with pip

Install the package with:

!pip install git+https://github.com/maianhpuco/DIMVImputation.git 

Option 2: Install from source

  • Step 1: Clone the repository

git clone <repository-url> Then, create a virtual environment and activate the environment.

  • Step 2: Install the libraries from the "requirements.txt" file.
pip install -r requirements.txt 

4. Usages

For example, let's create a sample dataset named missing_data using a numpy array.

#Create train test split
test_size = .2
split_index = int(len(missing_data) * (1 - test_size))

X_train_ori, X_test_ori = data[:split_index, :], data[split_index:, :]

X_train_miss = missing_data[:split_index, :]
X_test_miss = missing_data[split_index:, :]  

If you install with with pip

from DIMVImputation import DIMVImputation

# Create an instance of the DIMVImputation class
imputer = DIMVImputation()

# Fit the imputer on the training set to compute the covariance matrix 
imputer.fit(X_train_miss, initializing=False)

# Apply imputation to the missing data that we want to impute 
X_test_imputed = imputer.transform(X_test_miss)  

If you install with option 2(clone the repo)

The .fit() function is applied to the training set to compute the covariance matrix, which is then calculated based on the training set. Fit the model on the train set:

from DIMVImputation.DIMVImputation import DIMVImputation

# Create an instance of the DIMVImputation class
imputer = DIMVImputation()

# Fit the imputer on the training set to compute the covariance matrix 
imputer.fit(X_train_miss, initializing=False)

# Apply imputation to the missing data that we want to impute 
X_test_imputed = imputer.transform(X_test_miss)  

Cross-validation options

By default, DIMVImputation uses cross-validation to determine the optimal value for the regularization parameter (alpha). The default regularization parameter values include alphas of 0.0, 0.01, 0.1, 1.0, 10.0, and 100.0. Moreover, the default percentage of data utilized for training in cross-validation is set to 100%.

  • To specify a custom range of alpha values for cross-validation, use .cross_validate() to conduct a grid search for the best alpha value. Once determined, this transformation is applied to the missing data (X_test_miss). For instance:
# Define your alpha grid and specify the data percentage for cross-validation
imputer.cross_validate(alphas=[0.0, 0.01, 0.1, 1.0]) 
X_test_imp = imputer.transform(X_test_miss, cross_validation=False)
  • If you aim to modify the percentage of training data utilized in cross-validation (note: this doesn't affect the .fit() method's training set), you can adjust it as follows:
# Define your alpha grid and set the data percentage for cross-validation
imputer.cross_validate(train_percent=80, alphas=[0.0, 0.01, 0.1, 1.0]) 
X_test_imp = imputer.transform(X_test_miss, cross_validation=False)
  • To incorporate FeatureSelection by eliminating irrelevant features based on a threshold, apply the following settings. This feature selection criterion will be applied to both cross-validation and the .fit() method:
imputer.cross_validate(
    train_percent=80,
    alphas=[0.0, 0.01, 0.1, 1.0],
    features_corr_threshold=0.3,
    mlargest_features=5 
) 
X_test_imp = imputer.transform(X_test_miss, cross_validation=False) 

About

The code base for paper "Conditional expectation with regularization for missing data imputation"


Languages

Language:Python 63.4%Language:Jupyter Notebook 36.6%