Large tibble with very few non-missings shows 100% missing in miss_var_summary()
jzadra opened this issue · comments
I've been puzzling over why I was seeing a column with 100 percent missing data in a large tibble (~30,000 rows) made up of several tibbles combined with bind_rows()
despite seeing that one of the individual tibbles does not show 100% missing for that column.
After a bunch of wild goose chases, I realized that the issue was that miss_var_summary()
(and probably similar naniar functions) was rounding the pct_miss
column up.
library(tidyverse)
library(naniar)
df <- tibble(x = rep(NA_real_, 30000)) %>%
add_row(x = 0)
df %>% miss_var_summary()
#> # A tibble: 1 x 3
#> variable n_miss pct_miss
#> <chr> <int> <dbl>
#> 1 x 30000 100.
df %>% filter(!is.na(x))
#> # A tibble: 1 x 1
#> x
#> <dbl>
#> 1 0
In this example, the percent missing is actually 99.9967%.
All that to say, I'd like to suggest that for the edge cases of near-zero and near-100 percent missings not be rounded to avoid this confusion.
Thanks for the reprex! I can reproduce this.
This is actually an issue with how tibble
prints - if we coerce to a data.frame
we get the value back out.
library(tidyverse)
library(naniar)
df <- tibble(x = rep(NA_real_, 30000)) %>%
add_row(x = 0)
df %>% miss_var_summary()
#> # A tibble: 1 x 3
#> variable n_miss pct_miss
#> <chr> <int> <dbl>
#> 1 x 30000 100.
df %>% miss_var_summary() %>% as.data.frame()
#> variable n_miss pct_miss
#> 1 x 30000 99.99667
Created on 2021-05-13 by the reprex package (v2.0.0)
Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.0.5 (2021-03-31)
#> os macOS Big Sur 10.16
#> system x86_64, darwin17.0
#> ui X11
#> language (EN)
#> collate en_AU.UTF-8
#> ctype en_AU.UTF-8
#> tz Australia/Perth
#> date 2021-05-13
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] standard (@0.2.1)
#> backports 1.2.1 2020-12-09 [1] standard (@1.2.1)
#> broom 0.7.5 2021-02-19 [1] CRAN (R 4.0.2)
#> cellranger 1.1.0 2016-07-27 [1] standard (@1.1.0)
#> cli 2.5.0 2021-04-26 [1] CRAN (R 4.0.2)
#> colorspace 2.0-0 2020-11-11 [1] standard (@2.0-0)
#> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.2)
#> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.0.2)
#> dbplyr 2.1.0 2021-02-03 [1] CRAN (R 4.0.2)
#> digest 0.6.27 2020-10-24 [1] standard (@0.6.27)
#> dplyr * 1.0.6 2021-05-05 [1] CRAN (R 4.0.2)
#> ellipsis 0.3.1 2020-05-15 [1] standard (@0.3.1)
#> evaluate 0.14 2019-05-28 [1] standard (@0.14)
#> fansi 0.4.2 2021-01-15 [1] CRAN (R 4.0.2)
#> forcats * 0.5.1 2021-01-27 [1] CRAN (R 4.0.2)
#> fs 1.5.0 2020-07-31 [1] standard (@1.5.0)
#> generics 0.1.0 2020-10-31 [1] standard (@0.1.0)
#> ggplot2 * 3.3.3 2020-12-30 [1] CRAN (R 4.0.2)
#> glue 1.4.2 2020-08-27 [1] standard (@1.4.2)
#> gtable 0.3.0 2019-03-25 [1] standard (@0.3.0)
#> haven 2.3.1 2020-06-01 [1] standard (@2.3.1)
#> highr 0.8 2019-03-20 [1] standard (@0.8)
#> hms 1.0.0 2021-01-13 [1] CRAN (R 4.0.2)
#> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.2)
#> httr 1.4.2 2020-07-20 [1] standard (@1.4.2)
#> jsonlite 1.7.2 2020-12-09 [1] standard (@1.7.2)
#> knitr 1.31 2021-01-27 [1] CRAN (R 4.0.2)
#> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.2)
#> lubridate 1.7.10 2021-02-26 [1] CRAN (R 4.0.2)
#> magrittr 2.0.1 2020-11-17 [1] standard (@2.0.1)
#> modelr 0.1.8 2020-05-19 [1] standard (@0.1.8)
#> munsell 0.5.0 2018-06-12 [1] standard (@0.5.0)
#> naniar * 0.6.0.9000 2020-12-23 [1] local
#> pillar 1.6.0 2021-04-13 [1] CRAN (R 4.0.2)
#> pkgconfig 2.0.3 2019-09-22 [1] standard (@2.0.3)
#> purrr * 0.3.4 2020-04-17 [1] standard (@0.3.4)
#> R6 2.5.0 2020-10-28 [1] standard (@2.5.0)
#> Rcpp 1.0.6 2021-01-15 [1] CRAN (R 4.0.2)
#> readr * 1.4.0 2020-10-05 [1] standard (@1.4.0)
#> readxl 1.3.1 2019-03-13 [1] standard (@1.3.1)
#> reprex 2.0.0 2021-04-02 [1] CRAN (R 4.0.2)
#> rlang 0.4.11 2021-04-30 [1] CRAN (R 4.0.2)
#> rmarkdown 2.7 2021-02-19 [1] CRAN (R 4.0.2)
#> rstudioapi 0.13 2020-11-12 [1] standard (@0.13)
#> rvest 1.0.0 2021-03-09 [1] CRAN (R 4.0.2)
#> scales 1.1.1 2020-05-11 [1] standard (@1.1.1)
#> sessioninfo 1.1.1 2018-11-05 [1] standard (@1.1.1)
#> stringi 1.5.3 2020-09-09 [1] standard (@1.5.3)
#> stringr * 1.4.0 2019-02-10 [1] standard (@1.4.0)
#> styler 1.4.1 2021-03-30 [1] CRAN (R 4.0.2)
#> tibble * 3.1.1 2021-04-18 [1] CRAN (R 4.0.3)
#> tidyr * 1.1.3 2021-03-03 [1] CRAN (R 4.0.2)
#> tidyselect 1.1.0 2020-05-11 [1] standard (@1.1.0)
#> tidyverse * 1.3.0 2019-11-21 [1] standard (@1.3.0)
#> utf8 1.2.1 2021-03-12 [1] CRAN (R 4.0.2)
#> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.0.2)
#> visdat 0.5.3 2019-02-15 [1] CRAN (R 4.0.2)
#> withr 2.4.2 2021-04-18 [1] CRAN (R 4.0.3)
#> xfun 0.22 2021-03-11 [1] CRAN (R 4.0.2)
#> xml2 1.3.2 2020-04-23 [1] standard (@1.3.2)
#> yaml 2.2.1 2020-02-01 [1] standard (@2.2.1)
#>
#> [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library
I'll cross post this to tibble, this isn't ideal behaviour, thanks for reporting.
library(tidyverse)
library(naniar)
N <- 30000000
df <- tibble(x = rep(NA_real_, N)) %>%
add_row(x = 0)
df %>% miss_var_summary()
#> # A tibble: 1 × 3
#> variable n_miss pct_miss
#> <chr> <int> <dbl>
#> 1 x 30000000 100.
df %>%
miss_var_summary() %>%
as.data.frame()
#> variable n_miss pct_miss
#> 1 x 30000000 100
df %>%
miss_var_summary() %>%
mutate(pct_miss = num(pct_miss, digits = trunc(log10(N) + 2)))
#> # A tibble: 1 × 3
#> variable n_miss pct_miss
#> <chr> <int> <num:.9!>
#> 1 x 30000000 99.999996667
df %>%
miss_var_summary() %>%
mutate(pct_miss = num(pct_miss, digits = trunc(log10(N) + 2))) %>%
as.data.frame()
#> variable n_miss pct_miss
#> 1 x 30000000 99.999996667
Created on 2021-08-01 by the reprex package (v2.0.0.9000)
pillar::num()
(reexported as tibble::num()
) allow specifying arbitrary digits or significant figures in this specific example.
You could also store pct_miss
as a value between 0 and 1 with num(scale = 100)
.
Oh that's awesome! Thanks so much, @krlmlr - I'll add this feature in the next release.
OK so here is the old way
library(tidyverse)
library(naniar)
N <- 30000000
df <- tibble(x = rep(NA_real_, N)) %>%
add_row(x = 0)
df %>% miss_var_summary()
#> # A tibble: 1 × 3
#> variable n_miss pct_miss
#> <chr> <int> <dbl>
#> 1 x 30000000 100.
df %>%
miss_var_summary() %>%
as.data.frame()
#> variable n_miss pct_miss
#> 1 x 30000000 100
Created on 2023-04-10 with reprex v2.0.2
Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.2.3 (2023-03-15)
#> os macOS Ventura 13.2
#> system aarch64, darwin20
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz Australia/Hobart
#> date 2023-04-10
#> pandoc 2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.0)
#> backports 1.4.1 2021-12-13 [1] CRAN (R 4.2.0)
#> broom 1.0.3 2023-01-25 [1] CRAN (R 4.2.0)
#> cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.2.0)
#> cli 3.6.0 2023-01-09 [1] CRAN (R 4.2.0)
#> colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.2.0)
#> crayon 1.5.2 2022-09-29 [1] CRAN (R 4.2.0)
#> DBI 1.1.3 2022-06-18 [1] CRAN (R 4.2.0)
#> dbplyr 2.3.0 2023-01-16 [1] CRAN (R 4.2.0)
#> digest 0.6.31 2022-12-11 [1] CRAN (R 4.2.0)
#> dplyr * 1.1.1 2023-03-22 [1] CRAN (R 4.2.0)
#> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0)
#> evaluate 0.20 2023-01-17 [1] CRAN (R 4.2.0)
#> fansi 1.0.4 2023-01-22 [1] CRAN (R 4.2.0)
#> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0)
#> forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.2.0)
#> fs 1.6.1 2023-02-06 [1] CRAN (R 4.2.0)
#> gargle 1.3.0 2023-01-30 [1] CRAN (R 4.2.0)
#> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.0)
#> ggplot2 * 3.4.1 2023-02-10 [1] CRAN (R 4.2.0)
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0)
#> googledrive 2.0.0 2021-07-08 [1] CRAN (R 4.2.0)
#> googlesheets4 1.0.1 2022-08-13 [1] CRAN (R 4.2.0)
#> gtable 0.3.1 2022-09-01 [1] CRAN (R 4.2.0)
#> haven 2.5.1 2022-08-22 [1] CRAN (R 4.2.0)
#> hms 1.1.2 2022-08-19 [1] CRAN (R 4.2.0)
#> htmltools 0.5.4 2022-12-07 [1] CRAN (R 4.2.0)
#> httr 1.4.4 2022-08-17 [1] CRAN (R 4.2.0)
#> jsonlite 1.8.4 2022-12-06 [1] CRAN (R 4.2.0)
#> knitr 1.42 2023-01-25 [1] CRAN (R 4.2.0)
#> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.0)
#> lubridate 1.9.1 2023-01-24 [1] CRAN (R 4.2.0)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0)
#> modelr 0.1.10 2022-11-11 [1] CRAN (R 4.2.0)
#> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.0)
#> naniar * 1.0.0.9000 2023-04-10 [1] local
#> pillar 1.8.1 2022-08-19 [1] CRAN (R 4.2.0)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0)
#> purrr * 1.0.1 2023-01-10 [1] CRAN (R 4.2.0)
#> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.2.0)
#> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.0)
#> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.2.0)
#> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.2.0)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0)
#> readr * 2.1.3 2022-10-01 [1] CRAN (R 4.2.0)
#> readxl 1.4.1 2022-08-17 [1] CRAN (R 4.2.0)
#> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.0)
#> rlang 1.1.0 2023-03-14 [1] CRAN (R 4.2.0)
#> rmarkdown 2.20 2023-01-19 [1] CRAN (R 4.2.0)
#> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.0)
#> rvest 1.0.3 2022-08-19 [1] CRAN (R 4.2.0)
#> scales 1.2.1 2022-08-20 [1] CRAN (R 4.2.0)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0)
#> stringi 1.7.12 2023-01-11 [1] CRAN (R 4.2.0)
#> stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.2.0)
#> styler 1.9.0 2023-01-15 [1] CRAN (R 4.2.0)
#> tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.2.0)
#> tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.2.0)
#> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.0)
#> tidyverse * 1.3.2 2022-07-18 [1] CRAN (R 4.2.0)
#> timechange 0.2.0 2023-01-11 [1] CRAN (R 4.2.0)
#> tzdb 0.3.0 2022-03-28 [1] CRAN (R 4.2.0)
#> utf8 1.2.3 2023-01-31 [1] CRAN (R 4.2.0)
#> vctrs 0.6.1 2023-03-22 [1] CRAN (R 4.2.0)
#> visdat 0.6.0 2023-02-02 [1] local
#> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0)
#> xfun 0.37 2023-01-31 [1] CRAN (R 4.2.0)
#> xml2 1.3.3 2021-11-30 [1] CRAN (R 4.2.0)
#> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.2.0)
#>
#> [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
#>
#> ──────────────────────────────────────────────────────────────────────────────
And the new way
library(tidyverse)
library(naniar)
N <- 30000000
df <- tibble(x = rep(NA_real_, N)) %>%
add_row(x = 0)
df %>% miss_var_summary()
#> # A tibble: 1 × 3
#> variable n_miss pct_miss
#> <chr> <int> <num>
#> 1 x 30000000 100.
df %>% miss_var_summary(digits = 6)
#> # A tibble: 1 × 3
#> variable n_miss pct_miss
#> <chr> <int> <num:.6!>
#> 1 x 30000000 99.999997
Created on 2023-04-10 with reprex v2.0.2
Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.2.3 (2023-03-15)
#> os macOS Ventura 13.2
#> system aarch64, darwin20
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz Australia/Hobart
#> date 2023-04-10
#> pandoc 2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.0)
#> backports 1.4.1 2021-12-13 [1] CRAN (R 4.2.0)
#> broom 1.0.3 2023-01-25 [1] CRAN (R 4.2.0)
#> cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.2.0)
#> cli 3.6.0 2023-01-09 [1] CRAN (R 4.2.0)
#> colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.2.0)
#> crayon 1.5.2 2022-09-29 [1] CRAN (R 4.2.0)
#> DBI 1.1.3 2022-06-18 [1] CRAN (R 4.2.0)
#> dbplyr 2.3.0 2023-01-16 [1] CRAN (R 4.2.0)
#> digest 0.6.31 2022-12-11 [1] CRAN (R 4.2.0)
#> dplyr * 1.1.1 2023-03-22 [1] CRAN (R 4.2.0)
#> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0)
#> evaluate 0.20 2023-01-17 [1] CRAN (R 4.2.0)
#> fansi 1.0.4 2023-01-22 [1] CRAN (R 4.2.0)
#> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0)
#> forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.2.0)
#> fs 1.6.1 2023-02-06 [1] CRAN (R 4.2.0)
#> gargle 1.3.0 2023-01-30 [1] CRAN (R 4.2.0)
#> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.0)
#> ggplot2 * 3.4.1 2023-02-10 [1] CRAN (R 4.2.0)
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0)
#> googledrive 2.0.0 2021-07-08 [1] CRAN (R 4.2.0)
#> googlesheets4 1.0.1 2022-08-13 [1] CRAN (R 4.2.0)
#> gtable 0.3.1 2022-09-01 [1] CRAN (R 4.2.0)
#> haven 2.5.1 2022-08-22 [1] CRAN (R 4.2.0)
#> hms 1.1.2 2022-08-19 [1] CRAN (R 4.2.0)
#> htmltools 0.5.4 2022-12-07 [1] CRAN (R 4.2.0)
#> httr 1.4.4 2022-08-17 [1] CRAN (R 4.2.0)
#> jsonlite 1.8.4 2022-12-06 [1] CRAN (R 4.2.0)
#> knitr 1.42 2023-01-25 [1] CRAN (R 4.2.0)
#> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.0)
#> lubridate 1.9.1 2023-01-24 [1] CRAN (R 4.2.0)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0)
#> modelr 0.1.10 2022-11-11 [1] CRAN (R 4.2.0)
#> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.0)
#> naniar * 1.0.0.9000 2023-04-10 [1] local
#> pillar 1.8.1 2022-08-19 [1] CRAN (R 4.2.0)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0)
#> purrr * 1.0.1 2023-01-10 [1] CRAN (R 4.2.0)
#> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.2.0)
#> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.0)
#> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.2.0)
#> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.2.0)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0)
#> readr * 2.1.3 2022-10-01 [1] CRAN (R 4.2.0)
#> readxl 1.4.1 2022-08-17 [1] CRAN (R 4.2.0)
#> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.0)
#> rlang 1.1.0 2023-03-14 [1] CRAN (R 4.2.0)
#> rmarkdown 2.20 2023-01-19 [1] CRAN (R 4.2.0)
#> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.0)
#> rvest 1.0.3 2022-08-19 [1] CRAN (R 4.2.0)
#> scales 1.2.1 2022-08-20 [1] CRAN (R 4.2.0)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0)
#> stringi 1.7.12 2023-01-11 [1] CRAN (R 4.2.0)
#> stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.2.0)
#> styler 1.9.0 2023-01-15 [1] CRAN (R 4.2.0)
#> tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.2.0)
#> tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.2.0)
#> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.0)
#> tidyverse * 1.3.2 2022-07-18 [1] CRAN (R 4.2.0)
#> timechange 0.2.0 2023-01-11 [1] CRAN (R 4.2.0)
#> tzdb 0.3.0 2022-03-28 [1] CRAN (R 4.2.0)
#> utf8 1.2.3 2023-01-31 [1] CRAN (R 4.2.0)
#> vctrs 0.6.1 2023-03-22 [1] CRAN (R 4.2.0)
#> visdat 0.6.0 2023-02-02 [1] local
#> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0)
#> xfun 0.37 2023-01-31 [1] CRAN (R 4.2.0)
#> xml2 1.3.3 2021-11-30 [1] CRAN (R 4.2.0)
#> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.2.0)
#>
#> [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
#>
#> ──────────────────────────────────────────────────────────────────────────────