eltonlaw / impyute

Data imputations library to preprocess datasets with missing data

Home Page:http://impyute.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Name change request: mice

stefvanbuuren opened this issue · comments

Dear Elton,

Thanks for your effort to implement an algorithm for imputing multivariate data.

I’d like to request a name change of your impyute.imputation.cs.mice procedure. The documentation of this procedure says that it implements Multivariate Imputation by Chained Equations (MICE) from my JSS 2011 paper. However, this documentation is not accurate since your procedure does not implement the MICE algorithm. It differs in important respects from my method:

  • Your procedure provides a single imputation, whereas MICE is a procedure for generating multiple imputations;
  • Your procedure imputes the “best” (predicted) value, while the MICE algorithm always adds noise;
  • Your procedure uses linear regression, whereas the MICE algorithm is open to any type of imputation model;
  • Your procedure uses different convergence criteria.

These differences have profound methodological implications. Advertising your procedure as “MICE” will create confusion among analysts, who might be led to believe that they are doing MICE when in fact they are not.

Your procedure is an implementation of Buck’s method published in 1960 (described in more detail in Little & Rubin 2002), so I would suggest that you could perhaps rename to “buck”?

With regards,
Stef van Buuren

Well...this is really, really terrible. Apologies for any trouble this has brought you (and anyone else this has affected), I understand that this mistake could potentially make a huge impact in certain cases. It's been more than year but I thought I had understood the paper, that was my mistake. I haven't put enough effort into validating the results of the imputations and that's something I should have prioritized.

This will be remedied as soon as possible and I will make extra effort to ensure this won't happen ever again. Sorry.

And thank you for raising the issue, the explanation is enlightening (and sombering)

Hmm, Buck's method seems to vary with the current implementation in two ways:

  1. For each column, regression coefficients are calculated on the complete case. In the current implementation, a mean impute is calculated on all null values in other columns.
  2. Buck's is just one pass whereas what's current implemented is iterative

I could just rewrite it to completely follow the method in the paper...but I don't know if that would be most effective. A concern I have: in 1) a faked complete case is generated on the entire data because anecdotally, I've sometimes worked with sparse datasets and I feel that asking for complete cases would pare down the resulting input set too much. Perhaps if there was a way to optimize the path of which columns get computed first such that the total amount (or some other measure) of dropped rows is minimized. Not too sure on this, need to do more research.

Anyways, for the short term, going to temporarily rename it buck_iterative while a longer term fix (and a better name) is prepared.

Elton, wonderful.

I agree that buck_iterative is indeed the proper name for the procedure.

Stef.