ppalmes / Impute.jl

Imputation methods for missing data in julia

Home Page:https://invenia.github.io/Impute.jl/latest/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Impute

stable latest Build Status Build status codecov

Impute.jl provides various methods for handling missing data in Vectors, Matrices and Tables.

Installation

julia> using Pkg; Pkg.add("Impute")

Quickstart

Let's start by loading our dependencies:

julia> using DataFrames, RDatasets, Impute

We'll also want some test data containing missings to work with:

julia> df = dataset("boot", "neuro")
469×6 DataFrames.DataFrame
│ Row │ V1       │ V2       │ V3      │ V4       │ V5       │ V6       │
│     │ Float64⍰ │ Float64⍰ │ Float64 │ Float64⍰ │ Float64⍰ │ Float64⍰ │
├─────┼──────────┼──────────┼─────────┼──────────┼──────────┼──────────┤
│ 1missing-203.7-84.118.5missingmissing  │
│ 2missing-203.0-97.825.8134.7missing  │
│ 3missing-249.0-92.127.8177.1missing  │
│ 4missing-231.5-97.527.0150.3missing  │
│ 5missingmissing-130.125.8160.0missing  │
│ 6missing-223.1-70.762.1197.5missing  │
│ 7missing-164.8-12.276.8202.8missing462missing-207.3-88.39.6104.1218.0    │
│ 463-242.6-142.0-21.869.8148.7missing  │
│ 464-235.9-128.8-33.168.8177.1missing  │
│ 465missing-140.8-38.758.1186.3missing  │
│ 466missing-149.5-40.362.8139.7242.5    │
│ 467-247.6-157.8-53.328.3122.9227.6    │
│ 468missing-154.9-50.828.1119.9201.1    │
│ 469missing-180.7-70.933.7114.8222.5

Our first instinct might be to drop all observations, but this leaves us too few rows to work with:

julia> Impute.drop(df)
4×6 DataFrames.DataFrame
│ Row │ V1      │ V2      │ V3      │ V4      │ V5      │ V6      │
│     │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ 1-247.0-132.2-18.828.281.4237.9   │
│ 2-234.0-140.8-56.528.0114.3222.9   │
│ 3-215.8-114.8-18.465.3171.6249.7   │
│ 4-247.6-157.8-53.328.3122.9227.6

We could try imputing the values with linear interpolation, but that still leaves missing data at the head and tail of our dataset:

julia> Impute.interp(df)
469×6 DataFrames.DataFrame
│ Row │ V1       │ V2       │ V3      │ V4       │ V5       │ V6       │
│     │ Float64⍰ │ Float64⍰ │ Float64 │ Float64⍰ │ Float64⍰ │ Float64⍰ │
├─────┼──────────┼──────────┼─────────┼──────────┼──────────┼──────────┤
│ 1missing-203.7-84.118.5missingmissing  │
│ 2missing-203.0-97.825.8134.7missing  │
│ 3missing-249.0-92.127.8177.1missing  │
│ 4missing-231.5-97.527.0150.3missing  │
│ 5missing-227.3-130.125.8160.0missing  │
│ 6missing-223.1-70.762.1197.5missing  │
│ 7missing-164.8-12.276.8202.8missing462-241.025-207.3-88.39.6104.1218.0    │
│ 463-242.6-142.0-21.869.8148.7224.125  │
│ 464-235.9-128.8-33.168.8177.1230.25   │
│ 465-239.8-140.8-38.758.1186.3236.375  │
│ 466-243.7-149.5-40.362.8139.7242.5    │
│ 467-247.6-157.8-53.328.3122.9227.6    │
│ 468missing-154.9-50.828.1119.9201.1    │
│ 469missing-180.7-70.933.7114.8222.5

Finally, we can chain multiple simple methods together to give a complete dataset:

julia> Impute.interp(df) |> Impute.locf() |> Impute.nocb()
469×6 DataFrames.DataFrame
│ Row │ V1       │ V2       │ V3      │ V4       │ V5       │ V6       │
│     │ Float64⍰ │ Float64⍰ │ Float64 │ Float64⍰ │ Float64⍰ │ Float64⍰ │
├─────┼──────────┼──────────┼─────────┼──────────┼──────────┼──────────┤
│ 1-233.6-203.7-84.118.5134.7222.7    │
│ 2-233.6-203.0-97.825.8134.7222.7    │
│ 3-233.6-249.0-92.127.8177.1222.7    │
│ 4-233.6-231.5-97.527.0150.3222.7    │
│ 5-233.6-227.3-130.125.8160.0222.7    │
│ 6-233.6-223.1-70.762.1197.5222.7    │
│ 7-233.6-164.8-12.276.8202.8222.7462-241.025-207.3-88.39.6104.1218.0    │
│ 463-242.6-142.0-21.869.8148.7224.125  │
│ 464-235.9-128.8-33.168.8177.1230.25   │
│ 465-239.8-140.8-38.758.1186.3236.375  │
│ 466-243.7-149.5-40.362.8139.7242.5    │
│ 467-247.6-157.8-53.328.3122.9227.6    │
│ 468-247.6-154.9-50.828.1119.9201.1    │
│ 469-247.6-180.7-70.933.7114.8222.5

Warning:

  • Your approach should depend on the properties of you data (e.g., MCAR, MAR, MNAR).
  • In-place calls aren't guaranteed to mutate the original data, but it will try avoid copying if possible. In the future, it may be possible to detect whether in-place operations are permitted on an array or table using traits:

About

Imputation methods for missing data in julia

https://invenia.github.io/Impute.jl/latest/

License:Other


Languages

Language:Julia 100.0%