Oracen-zz / MIDAS

Multiple imputation utilising denoising autoencoder for approximate Bayesian inference

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Impute unseen/test data

lilasaba opened this issue · comments

Hello,
thank you for the great work.

I'm trying to impute missing values on data different than the training set, by initializing a new imputer object liko so:

imputer_test = Midas(layer_structure=[128,128,128],vae_layer=False,seed=908)
imputer_test.build_model(X_test,categorical_columns=feature_cols)

Then I call imputer.generate_samples() to impute the missing data in the test set. However, when the test set consist of less than 100 samples, there are still NaN's in the .output_lists.
Is there a theoretical minimum input df size (sorry I'm not familiar with how autoencoders work) ?

Thanks!

No, typically you should see no NaNs in the output. I've run MIDAS with as few as 50 samples and it worked fine...perhaps there was some instability in training? I'd need to inspect the code a little better to assess what's going on. Generally, if I encounter NaNs, it's because I've improperly preprocessed categorical features that are then fed into softmax functions. Feel free to up load a code sample and I'll check it out when I get some time.

As a general rule though, neural networks like big datasets. On smaller datasets, it's likely that alternative algorithms such as MICE, Hmisc or Amelia II will outperform MIDAS.

Thanks for the tips, I'll definitely try the methods you are mentioning, but for now I need to stick with Python.

The dataset I'm using contains about 300k rows with 12 features; all of them binary (maybe that's the problem?) - not sure if that qualifies as big.
I've uploaded a 5k sample of it, in case you have time to reproduce what I'm doing.

So here's what's happening:

## Load data.
X_train = pd.read_csv('train.csv',header=None)
X_test = pd.read_csv('test.csv',header=None)

## Init Midas (features are independent).
feature_cols = X_train.columns
imputer = Midas(layer_structure=[128,128,128],vae_layer=False,seed=908)
imputer.build_model(X_train,categorical_columns=feature_cols)

## Overimpute; getting 0.18 aggregated error.
imputer.overimpute(training_epochs=5,report_ival=1,report_samples=5,plot_all=False)
## Train; loss: 3.73.
imputer.train_model(training_epochs=5,verbosity_ival=1)

## Now init Midas on test data (maybe not the proper way?).
imputer_test = Midas(layer_structure=[128,128,128],vae_layer=False,seed=908)
imputer_test.build_model(X_test,categorical_columns=feature_cols)

## Init Midas with 50 rows from the test data.
imputer_50 = Midas(layer_structure=[128,128,128],vae_layer=False,seed=908)
rows = pd.DataFrame(X_test.loc[5:55,:])
imputer_50.build_model(rows,categorical_columns=feature_cols)

## Generate samples for the test; getting zero NaN's.
imputer_test.generate_samples()
last_test = imputer_test.output_list[-1]
last_test.isna().sum()

## Generate samples for the 50-sample test set; getting 4 NaN's (for features 10 and 11).
imputer_50.generate_samples()
last_test_50 = imputer_50.output_list[-1]
last_test_50.isna().sum()

Interesting. I won't have time to replicate this weekend, but I'll see what I can see.

The first thing I'd suggest is that building separate imputation models for train and test is ill-advised. It's essentially equivalent to assuming each comes from separate DGMs. Either train on the train set and impute the test set from that, or exclude the target column and build the model on the unified Xs. Overlooking that, let's move to the NaNs on short data.

One thing you haven't done is trained the model on the smaller datasets. The imputation outputs are simply the result of the randomly initialised weights. If you want to swap out the dataset with the trained model, you'll need to manually reorder the columns to the same as X_train, set imputer.imputation_target = X_test and generate a missingness matrix with imputer.na_matrix = X_test.notnull().astype(np.bool). Of course, you can swap X_test with rows. MIDAS was never designed to do this though, so I have no idea what might happen. In theory, at least, it should be fine...the calls to generate_samples check the .imputation_target and .na_matrix attributes on call, so as long as the number of columns is identical I assume everything will be fine. If I can ask you to post the results here, that will let me know if I have to rewrite anything. I think, however, I will add a manual data method to simplify unusual cases like yours.

Moving forward, I'd advise using the VAE component with small or simple datasets, as it seems to stabilise the results. By embedding in the lower dimensional distribution, it smooths over some of the weirdness that can happen at the extremes of a NNs operating range.

Either train on the train set and impute the test set from that,

This is exactly what I'm trying to do, and sorry if that wasn't clear; I would like to train an imputation model (without using the target variable) on the train set, and use that model to impute/transform the test set - which sometimes can be as small as a one-row df.

The reason why I built a new graph for the test set is because I wasn't sure how to input new data to the model trained on the train set, so I figured I would build the graph, and then calling generate_samples() (without training) would load in the model from the /tmp directory.
And it worked on the test set fine (obviously, I cannot validate the results), except when the test df contains less than around 400 rows - then the NaN's don't get imputed.

Now I tried what you are suggesting (where the X_train and X_test columns are already ordered):

imputer.imputation_target = X_test
imputer.na_matrix = X_test.notnull().astype(np.bool)
imputer.generate_samples()

but this way not one NaN gets replaced.

Did you ever figure this out? I am trying to do something similar. The training set has no missing data but the test set has data randomly missing. I am trying to input the test data set after training. I haven't seen how to do this properly. If you figured this out, please let me know how you did it.

Hey Alex,

Yeah, I also need to dig into the code a bit more just been trying to get my use case to work first to get an idea how it will perform. Fyi, It seems it remembers where the NaN values were during training so even if the columns are aligned that isn't enough. If I swap in data, that has the location of the NaN's the same as it was during training then it works. Since my training set had no NaN's this wasn't hard to match it to my testing set. But if the NaN's change location from training the model to testing the model with a new imputation target, it seems that it will output NaN's and not fill in every value.

But that being said, I think I will just append my testing data to the training dataset for now. Then, I was curious if it will be better to try to train with not much missingness in the training data (since it has no missingness originally) or should I inject a bunch of missingness into the training data artificially. I was going to experiment with this on smaller models to see what worked best but was curious if you had explored this at all.

Regards

Still, haven't tested that out but let you know if it doesn't work for some reason. I don't have the test set values so just been removing portions with NaN's on the training data.

Also, I was curious about the additional_data aspect. I don't exactly understand what this is used for. I have extra data for all values of training set but isn't in the testing set so just dropped those columns. I was assuming this wasn't what additional_data was used for but just curious.

Okay thanks! That makes sense. Actually, even if I change both the imputation_target and the na_matrix it doesn't fill in any of the NaN values. I just get the ouput with non of the NaN's imputed. If I change imputer.na_matrix to:

imputer.na_matrix = test_data[imputer.imputation_target.columns].isnull()

then non of the variables are filled in (all NaNs so that definitely is telling it at least which values are the NaNs). Is something else still blocking the NaN values from being filled in?

Only other variable I've seen is the imputer.na_idx...does this also need reset possibly? Or is that automatically set from the imputer.na_matrix.

Yeah, the problem was I didn't fill the NaN values with 0's.

I was using .notnull() that was a mistake before. Anyways, for someone else wanting to do this essentially you just do the below:

imputer.imputation_target =
    test_data[imputer.imputation_target.columns].copy().fillna(0)

imputer.na_matrix = test_data[imputer.imputation_target.columns].notnull()