Categorical variables and multiple imputation

Question

Categorical variables and multiple imputation

richardwu opened this issue 5 years ago · comments

For categorical variables I understand we one hot encode variables and take the argmax as the imputation result.

With multiple iterations, numerical values are averaged and the resulting mean is taken as the model's prediction. What is the recommended way to do this for categorical variables? Should the plurality be taken as the final imputation?

Additionally, would it be valid to simply take a single iterations imputation result as the model's prediction? Are there any bounds on the bias of the model as a function on the number of iterations?

Thanks!

Ranjit Lall · Answer 1 · Sat Aug 03 2019 16:51:34 GMT+0800 (China Standard Time)

Some advice on these issues can be found in Section 3.2 of: Lall, Ranjit. "How multiple imputation makes a difference." Political Analysis 24, no. 4 (2016): 414-433.

What do you plan to do with the m imputed datasets? If you'll be analyzing them, you should leave them as they are and combine the results of the m separate analyses using the "Rubin combination rules."

In general, you'll need at least several imputed datasets for valid estimation. Lall's suggested rule of thumb is that m should be equal to the average missing-data rate of all variables in the imputation model.