tom-metherell / Mice.jl

a package for missing data handling via multiple imputation by chained equations in Julia. It is heavily based on the R package {mice} by Stef van Buuren, Karin Groothuis-Oudshoorn and collaborators.

Home Page:https://tom-metherell.github.io/Mice.jl/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

What is Mice doing when there is no missing data?

nilshg opened this issue · comments

I just tried to recreate the benchmark results and downloaded the cirrhosis data. I didn't clock that it was an R data set where missing data is NA so ended up having a data set without any missing data. Still:

julia> @time mice(df)
 46.910824 seconds (11.11 k allocations: 2.727 MiB, 99.95% gc time)
Mids(418×20 DataFrame
 Row │ ID     N_Days  Status   Drug             Age    Sex      Ascites  Hepatomegaly  Spiders  Edema    Bilirubin  Cholesterol  Albumin  Copper   Alk_Phos  SGOT     Tryglicerides  Platelets  Prothrombin  Stage
     │ Int64  Int64   String3  String15

at the end of the output I see:

"Iteration 1, variable N_Days: imputation skipped - no missing data.", "Iteration 1, variable Status: imputation skipped - no missing data.", "Iteration 1, variable Drug: imputation skipped
- no missing data.", "Iteration 1, variable Age: imputation skipped - no missing data.", "Iteration 1, variable Sex: imputation skipped - no missing data.", "Iteration 1, variable Ascites: imputation skipped - no missing data.", "Iteration 1, variable Hepatomegaly: imputation skipped - no missing data.", "Iteration 1, variable Spiders: imputation skipped - no missing data.", "Iteration 1, variable Edema: imputation skipped - no missing data."  …  "Iteration 10, variable Bilirubin: imputation skipped - no missing data.", "Iteration 10, variable Cholesterol: imputation skipped - no missing data.", "Iteration 10, variable Albumin: imputation skipped - no missing data.", "Iteration 10, variable Copper: imputation skipped - no missing data.", "Iteration 10, variable Alk_Phos: imputation skipped - no missing data.", "Iteration 10, variable SGOT: imputation skipped - no missing data.", "Iteration 10, variable Tryglicerides: imputation skipped -
no missing data.", "Iteration 10, variable Platelets: imputation skipped - no missing data.", "Iteration 10, variable Prothrombin: imputation skipped - no missing data.", "Iteration 10, variable Stage: imputation skipped - no missing data."])

so it looks like it realises there isn't anything missing but why does it run for 47 seconds and allocate 11k times?

(As an aside, 99.95% gc time suggests that there is significant performance left on the table, have you profiled the code to see where the time is being spend and what is allocating so much?)

(PPS and I now see that when Mice is actually imputing it's much worse, julia> @time mice(df) 66.330727 seconds (23.23 M allocations: 2.851 GiB, 67.27% gc time, 31.15% compilation time: <1% of which was recompilation) so profiling those allocations looks more important than I thought)

@nilshg re: what happens when there are no missing data -

  • Currently the mice() function doesn't check whether it needs to do anything before initialising the matrices used to store imputations and their means/variances per imputation, so that's probably what the allocations are for. I could add a step where it checks for missing data before initialising these matrices, which would reduce the number of unnecessary allocations.
  • By default, the function calls GC.gc() after every single iteration, and there's an argument gcSchedule which allows this to be controlled. I found in testing that this increased performance for large jobs, but in cases like this it will of course make the performance much worse (50 GC.gc() calls for no work, which is presumably what results in the 99.95% gc time). I assume (or hope?) that if you were to run @time mice(df, gcSchedule = 0.0) the performance would be much better.

Haven't run the profiles yet so that's just my hunch. Can look into it more another time :D

It seems to me the first line of mice() should be something like

any(isa.(missing, eltype.(eachcol(df)))) && return df

Also are you saying you are pre-allocating the necessary containers for the imputations already? In that case it's even more unexpected that you are seeing millions of allocations.

I think the millions of allocations are not related to the imputation containers themselves but rather to the temporary matrices that are created on every iteration. The current procedure is:

  • Take a copy of the relevant predictors from the initial data object provided;
  • Copy over the current imputed values from the imputations container;
  • Update the imputations container with the new imputed values.

This is mainly because any categorical variables present need to be converted to dummy variables before the linear algebra steps, which can't (easily) be done in place. In the special case where there are no categorical variables, the number of allocations that are necessary would be far fewer.

WRT the allocations when doing nothing, I have now reduced the amount of stuff that mice() will do when there is nothing to impute (coming soon). As such:

julia> @time mice(data)
  0.001539 seconds (538 allocations: 173.953 KiB)                                                                                                        
Mids(418×20 DataFrame
 Row │ ID     N_Days  Status   Drug             Age    Sex      Ascites  Hepatomegaly  Spiders  Edema    Bilirubin  Cholesterol  Albumin  Copper   Alk_ ⋯
     │ Int64  Int64   String3  String15         Int64  String1  String3  String3       String3  String1  Float64    String7      Float64  String3  Stri ⋯

(mice() will now also throw an error when there is nothing to impute, but I disabled that for testing purposes)