Impute unseen/test data

Question

Impute unseen/test data

lilasaba opened this issue 6 years ago · comments

Hello,
thank you for the great work.

I'm trying to impute missing values on data different than the training set, by initializing a new imputer object liko so:

imputer_test = Midas(layer_structure=[128,128,128],vae_layer=False,seed=908)
imputer_test.build_model(X_test,categorical_columns=feature_cols)

Then I call imputer.generate_samples() to impute the missing data in the test set. However, when the test set consist of less than 100 samples, there are still NaN's in the .output_lists.
Is there a theoretical minimum input df size (sorry I'm not familiar with how autoencoders work) ?

Thanks!

Oracen-zz · Answer 1 · Wed Jun 27 2018 19:27:45 GMT+0800 (China Standard Time)

No, typically you should see no NaNs in the output. I've run MIDAS with as few as 50 samples and it worked fine...perhaps there was some instability in training? I'd need to inspect the code a little better to assess what's going on. Generally, if I encounter NaNs, it's because I've improperly preprocessed categorical features that are then fed into softmax functions. Feel free to up load a code sample and I'll check it out when I get some time.

As a general rule though, neural networks like big datasets. On smaller datasets, it's likely that alternative algorithms such as MICE, Hmisc or Amelia II will outperform MIDAS.

lilasaba · Answer 2 · Sat Jun 30 2018 21:05:58 GMT+0800 (China Standard Time)

Thanks for the tips, I'll definitely try the methods you are mentioning, but for now I need to stick with Python.

The dataset I'm using contains about 300k rows with 12 features; all of them binary (maybe that's the problem?) - not sure if that qualifies as big.
I've uploaded a 5k sample of it, in case you have time to reproduce what I'm doing.

So here's what's happening:

## Load data.
X_train = pd.read_csv('train.csv',header=None)
X_test = pd.read_csv('test.csv',header=None)

## Init Midas (features are independent).
feature_cols = X_train.columns
imputer = Midas(layer_structure=[128,128,128],vae_layer=False,seed=908)
imputer.build_model(X_train,categorical_columns=feature_cols)

## Overimpute; getting 0.18 aggregated error.
imputer.overimpute(training_epochs=5,report_ival=1,report_samples=5,plot_all=False)
## Train; loss: 3.73.
imputer.train_model(training_epochs=5,verbosity_ival=1)

## Now init Midas on test data (maybe not the proper way?).
imputer_test = Midas(layer_structure=[128,128,128],vae_layer=False,seed=908)
imputer_test.build_model(X_test,categorical_columns=feature_cols)

## Init Midas with 50 rows from the test data.
imputer_50 = Midas(layer_structure=[128,128,128],vae_layer=False,seed=908)
rows = pd.DataFrame(X_test.loc[5:55,:])
imputer_50.build_model(rows,categorical_columns=feature_cols)

## Generate samples for the test; getting zero NaN's.
imputer_test.generate_samples()
last_test = imputer_test.output_list[-1]
last_test.isna().sum()

## Generate samples for the 50-sample test set; getting 4 NaN's (for features 10 and 11).
imputer_50.generate_samples()
last_test_50 = imputer_50.output_list[-1]
last_test_50.isna().sum()

Oracen-zz · Answer 3 · Sun Jul 01 2018 09:52:34 GMT+0800 (China Standard Time)

Interesting. I won't have time to replicate this weekend, but I'll see what I can see.

The first thing I'd suggest is that building separate imputation models for train and test is ill-advised. It's essentially equivalent to assuming each comes from separate DGMs. Either train on the train set and impute the test set from that, or exclude the target column and build the model on the unified Xs. Overlooking that, let's move to the NaNs on short data.

One thing you haven't done is trained the model on the smaller datasets. The imputation outputs are simply the result of the randomly initialised weights. If you want to swap out the dataset with the trained model, you'll need to manually reorder the columns to the same as X_train, set imputer.imputation_target = X_test and generate a missingness matrix with imputer.na_matrix = X_test.notnull().astype(np.bool). Of course, you can swap X_test with rows. MIDAS was never designed to do this though, so I have no idea what might happen. In theory, at least, it should be fine...the calls to generate_samples check the .imputation_target and .na_matrix attributes on call, so as long as the number of columns is identical I assume everything will be fine. If I can ask you to post the results here, that will let me know if I have to rewrite anything. I think, however, I will add a manual data method to simplify unusual cases like yours.

Moving forward, I'd advise using the VAE component with small or simple datasets, as it seems to stabilise the results. By embedding in the lower dimensional distribution, it smooths over some of the weirdness that can happen at the extremes of a NNs operating range.

lilasaba · Answer 4 · Sun Jul 01 2018 22:59:24 GMT+0800 (China Standard Time)

Either train on the train set and impute the test set from that,

This is exactly what I'm trying to do, and sorry if that wasn't clear; I would like to train an imputation model (without using the target variable) on the train set, and use that model to impute/transform the test set - which sometimes can be as small as a one-row df.

The reason why I built a new graph for the test set is because I wasn't sure how to input new data to the model trained on the train set, so I figured I would build the graph, and then calling generate_samples() (without training) would load in the model from the /tmp directory.
And it worked on the test set fine (obviously, I cannot validate the results), except when the test df contains less than around 400 rows - then the NaN's don't get imputed.

Now I tried what you are suggesting (where the X_train and X_test columns are already ordered):

imputer.imputation_target = X_test
imputer.na_matrix = X_test.notnull().astype(np.bool)
imputer.generate_samples()

but this way not one NaN gets replaced.

excetara2 · Answer 5 · Sun Aug 19 2018 14:30:11 GMT+0800 (China Standard Time)

Did you ever figure this out? I am trying to do something similar. The training set has no missing data but the test set has data randomly missing. I am trying to input the test data set after training. I haven't seen how to do this properly. If you figured this out, please let me know how you did it.

Oracen-zz · Answer 6 · Sun Aug 19 2018 16:58:57 GMT+0800 (China Standard Time)

Hi there, Formally, I haven't added this functionality yet. I will add a proper way of doing this in the not-too-distant future, but I'm a bit bogged down with a dozen projects at present! The priority for MIDAS is to push out the documentation and paper, so then I can return to just coding in new features and squashing bugs. In the meantime, use a hack-y way to reassign the .imputation_target attribute to your new dataset. For instance, try: imputer.imputation_target = test_data[imputer.imputation_target.columns].copy() The column assignment is to maintain the columnar arrangement that was enforced during the call to .build_model(). Typically, you would co-train your imputation model on training and test sets (as MI is designed for statistical inference), but I get why you might want to maintain the separation. That's just why we've designed MIDAS in the way we have. Regards, Alex

…

On Sun, Aug 19, 2018 at 4:30 PM excetara2 ***@***.***> wrote: Did you ever figure this out? I am trying to do something similar. The training set has no missing data but the test set has data randomly missing. I am trying to input the test data set after training. I haven't seen how to do this properly. If you figured this out, please let me know how you did it. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#8 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AYqrbL6aYEsWA6_m0uRRTBJ4HXknINzkks5uSQX0gaJpZM4UwCIB> .

excetara2 · Answer 7 · Mon Aug 20 2018 18:47:14 GMT+0800 (China Standard Time)

Hey Alex,

Yeah, I also need to dig into the code a bit more just been trying to get my use case to work first to get an idea how it will perform. Fyi, It seems it remembers where the NaN values were during training so even if the columns are aligned that isn't enough. If I swap in data, that has the location of the NaN's the same as it was during training then it works. Since my training set had no NaN's this wasn't hard to match it to my testing set. But if the NaN's change location from training the model to testing the model with a new imputation target, it seems that it will output NaN's and not fill in every value.

But that being said, I think I will just append my testing data to the training dataset for now. Then, I was curious if it will be better to try to train with not much missingness in the training data (since it has no missingness originally) or should I inject a bunch of missingness into the training data artificially. I was going to experiment with this on smaller models to see what worked best but was curious if you had explored this at all.

Regards

Oracen-zz · Answer 8 · Mon Aug 20 2018 21:45:42 GMT+0800 (China Standard Time)

You're absolutely correct, I'd forgotten about that. imputer.na_matrix = test_data[imputer.imputation_target.columns].notnull() should work there. The .overimpute() method automatically injects additional (MCAR) missingness into the main dataset and monitors reconstruction error on known coordinates. It's useful for checking convergence times, but it generally overestimates the time required (as an additional x% data is removed). Have a play with that before coding your own solutions. You could simply train on the training set, but MIDAS has been designed to allow you to throw all of your data in at once to maximise the quality of the representation learned. We've actually done some pretty extensive simulation studies exploring how different noise schemes affect output accuracy and subsequent inference, but the paper is just taking some time to be drafted. In the meantime, feel free to run your own experiments and share the results, should you feel comfortable doing so. Regards, Alex

…

On Mon, Aug 20, 2018 at 8:47 PM excetara2 ***@***.***> wrote: Hey Alex, Yeah, I also need to dig into the code a bit more just been trying to get my use case to work first to get an idea how it will perform. Fyi, It seems it remembers where the NaN values were during training so even if the columns are aligned that isn't enough. If I swap in data, that has the location of the NaN's the same as it was during training then it works. Since my training set had no NaN's this wasn't hard to match it to my testing set. But if the NaN's change location from training the model to testing the model with a new imputation target, it seems that it will output NaN's and not fill in every value. But that being said, I think I will just append my testing data to the training dataset for now. Then, I was curious if it will be better to try to train with not much missingness in the training data (since it has no missingness originally) or should I inject a bunch of missingness into the training data artificially. I was going to experiment with this on smaller models to see what worked best but was curious if you had explored this at all. Regards — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#8 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AYqrbG7Mk3_vtqDxFc0uvzFLJ3Wgc73qks5uSpOzgaJpZM4UwCIB> .

excetara2 · Answer 9 · Tue Aug 21 2018 18:00:51 GMT+0800 (China Standard Time)

Still, haven't tested that out but let you know if it doesn't work for some reason. I don't have the test set values so just been removing portions with NaN's on the training data.

Also, I was curious about the additional_data aspect. I don't exactly understand what this is used for. I have extra data for all values of training set but isn't in the testing set so just dropped those columns. I was assuming this wasn't what additional_data was used for but just curious.

Oracen-zz · Answer 10 · Tue Aug 21 2018 18:55:08 GMT+0800 (China Standard Time)

Right, so additional_data is something I added in late in the process as a way of feeding additional data - making X bigger - without X' getting correspondingly larger. Basically, if you only need 5 columns worth of data, but have 3000 available, it doesn't make sense to make the output layer 3000 units wide as this would require a much larger model, with longer computation times, etc. It's a sneaky way of allowing for pseudo-prior information or time lags to be integrated into an imputation while still allowing my old Asus burner laptop (which only has a 920M) to train the model! If you don't have data in both train and test for this, then I'd avoid it...you'd also have to update this parameter with the additional data for the test training set, in much the same manner as we've already discussed. Alternatively, concat the datasets (removing the target variable, of course) to generate imputed datasets, leverage the original indices to split back into the original train/test then proceed as per normal. It's a bit more work and would depend on the precise problem you're facing (or the restrictions, if you're doing some kind of competition), but this should give you the "best of both worlds". Ultimately, though, it's your call. Alex

…

On Tue, Aug 21, 2018 at 8:00 PM excetara2 ***@***.***> wrote: Still, haven't tested that out but let you know if it doesn't work for some reason. I don't have the test set values so just been removing portions with NaN's on the training data. Also, I was curious about the additional_data aspect. I don't exactly understand what this is used for. I have extra data for all values of training set but isn't in the testing set so just dropped those columns. I was assuming this wasn't what additional_data was used for but just curious. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#8 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AYqrbDJ_S_Zh3b63UeJWRIu-RdriFvlnks5uS9pTgaJpZM4UwCIB> .

excetara2 · Answer 11 · Tue Aug 21 2018 20:24:53 GMT+0800 (China Standard Time)

Okay thanks! That makes sense. Actually, even if I change both the imputation_target and the na_matrix it doesn't fill in any of the NaN values. I just get the ouput with non of the NaN's imputed. If I change imputer.na_matrix to:

imputer.na_matrix = test_data[imputer.imputation_target.columns].isnull()

then non of the variables are filled in (all NaNs so that definitely is telling it at least which values are the NaNs). Is something else still blocking the NaN values from being filled in?

Only other variable I've seen is the imputer.na_idx...does this also need reset possibly? Or is that automatically set from the imputer.na_matrix.

Oracen-zz · Answer 12 · Tue Aug 21 2018 21:54:35 GMT+0800 (China Standard Time)

na_idx is the Tensorflow placeholder if memory serves. You should be using .notnull() rather than .isnull(), because na_idx inputsare passed to a boolean mask within the Tensorflow graph. Also, following that, empty values should be set to zero. You can use the Pandas method .fillna(0) to do so. NaNs are contagious, and the denoising autoencoder exploits boolean multiplication to mask those NaNs in a sensible way. Alex

…

On Tue, Aug 21, 2018 at 10:24 PM excetara2 ***@***.***> wrote: Okay thanks! That makes sense. Actually, even if I change both the imputation_target and the na_matrix it doesn't fill in any of the NaN values. I just get the ouput with non of the NaN's imputed. If I change imputer.na_matrix to: imputer.na_matrix = test_data[imputer.imputation_target.columns].isnull() then non of the variables are filled in. I think something is still holding that those are NaN values maybe. Only other variable I've seen is the imputer.na_idx...does this also need reset possibly? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#8 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AYqrbDqadWskmjWvhVrSE1G49X8JdGJcks5uS_wWgaJpZM4UwCIB> .

excetara2 · Answer 13 · Wed Aug 22 2018 13:54:06 GMT+0800 (China Standard Time)

Yeah, the problem was I didn't fill the NaN values with 0's.

I was using .notnull() that was a mistake before. Anyways, for someone else wanting to do this essentially you just do the below:

imputer.imputation_target =
    test_data[imputer.imputation_target.columns].copy().fillna(0)

imputer.na_matrix = test_data[imputer.imputation_target.columns].notnull()

Oracen-zz · Answer 14 · Wed Aug 22 2018 18:02:50 GMT+0800 (China Standard Time)

Similarly, I've coded up a .change_imputation_target() method that should handle it all automatically. Closing this issue.

…

On Wed, Aug 22, 2018 at 3:54 PM excetara2 ***@***.***> wrote: Yeah, the problem was I didn't fill the NaN values with 0's. I was using .notnull() that was a mistake before. Anyways, for someone else wanting to do this essentially you just do the below: imputer.imputation_target = test_data[imputer.imputation_target.columns].copy().fillna(0) imputer.na_matrix = test_data[imputer.imputation_target.columns].notnull() — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#8 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AYqrbBNaAgXP-gLm_w9b_PYzcmQJvRxsks5uTPH_gaJpZM4UwCIB> .