write_stata drops negative value labels
cschwem2er opened this issue · comments
Hi,
writing a Stata dataframe replaces negative value labels:
library(sjmisc)
library(tidyverse)
library(sjlabelled)
data(efc)
efc <- efc %>%
select(c160age, e17age) %>%
rec(rec = "15:30=-99 [young]; 31:55=-98 [middle]; 56:max=3 [old]")
frq(efc$c160age_r)
# carer' age (x) <numeric>
# total N=908 valid N=901 mean=-51.98 sd=50.38
val label frq raw.prc valid.prc cum.prc
-99 young 48 5.29 5.33 5.33
-98 middle 442 48.68 49.06 54.38
3 old 411 45.26 45.62 100.00
NA NA 7 0.77 NA NA
tmp <- tempdir()
write_stata(efc, paste0(tmp,"/df.dta"))
efc2 <- read_stata(paste0(tmp,"/df.dta"))
frq(efc2$c160age_r)
# carer' age (x) <numeric>
# total N=908 valid N=901 mean=2.40 sd=0.59
val label frq raw.prc valid.prc cum.prc
1 young 48 5.29 5.33 5.33
2 middle 442 48.68 49.06 54.38
3 old 411 45.26 45.62 100.00
NA NA 7 0.77 NA NA
I'm note sure whether this is related to tidyverse/haven#367, because for that issue, the Stata file seems to be correct. I checked the file produced with write_stata
and it already contains the incorrect labels.
sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 18.3
Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2 sjstats_0.14.2-4 sjlabelled_1.0.10 sjmisc_2.7.1.9000
[5] sjPlot_2.4.1.9000 forcats_0.3.0 stringr_1.3.0 dplyr_0.7.4
[9] purrr_0.2.4 readr_1.1.1 tidyr_0.8.0 tibble_1.4.2
[13] ggplot2_2.2.1.9000 tidyverse_1.2.1
loaded via a namespace (and not attached):
[1] TH.data_1.0-8 minqa_1.2.4 colorspace_1.3-2 modeltools_0.2-21
[5] ggridges_0.5.0 rsconnect_0.8.5 rprojroot_1.3-2 htmlTable_1.11.2
[9] estimability_1.3 snakecase_0.9.1 base64enc_0.1-3 rstudioapi_0.7
[13] glmmTMB_0.2.0 mvtnorm_1.0-7 lubridate_1.7.2 coin_1.2-2
[17] xml2_1.2.0 codetools_0.2-15 splines_3.4.4 mnormt_1.5-5
[21] knitr_1.20.2 effects_4.0-1 bayesplot_1.5.0 Formula_1.2-2
[25] jsonlite_1.5 nloptr_1.0.4 ggeffects_0.3.2 broom_0.4.4
[29] cluster_2.0.6 compiler_3.4.4 httr_1.3.1 emmeans_1.1.3
[33] backports_1.1.2 assertthat_0.2.0 Matrix_1.2-12 lazyeval_0.2.1
[37] survey_3.33-2 cli_1.0.0 acepack_1.4.1 htmltools_0.3.6
[41] tools_3.4.4 coda_0.19-1 gtable_0.2.0 glue_1.2.0
[45] reshape2_1.4.3 Rcpp_0.12.16 carData_3.0-1 cellranger_1.1.0
[49] nlme_3.1-131 psych_1.8.3.3 lmtest_0.9-36 lme4_1.1-17
[53] rvest_0.3.2 devtools_1.13.5 stringdist_0.9.4.7 MASS_7.3-48
[57] zoo_1.8-1 scales_0.5.0.9000 hms_0.4.2 parallel_3.4.4
[61] sandwich_2.4-0 pwr_1.2-2 TMB_1.7.12 RColorBrewer_1.1-2
[65] yaml_2.1.16 curl_3.1 gridExtra_2.3 memoise_1.1.0
[69] rpart_4.1-12 latticeExtra_0.6-28 stringi_1.1.7 highr_0.6
[73] checkmate_1.8.5 rlang_0.2.0.9001 pkgconfig_2.0.1 arm_1.10-1
[77] evaluate_0.10.1 lattice_0.20-35 prediction_0.3.2 bindr_0.1
[81] htmlwidgets_1.0 tidyselect_0.2.4 plyr_1.8.4 magrittr_1.5
[85] R6_2.2.2 Hmisc_4.1-1 multcomp_1.4-8 pillar_1.2.2
[89] haven_1.1.0 foreign_0.8-69 withr_2.1.2 survival_2.41-3
[93] abind_1.4-5 nnet_7.3-12 modelr_0.1.1 crayon_1.3.4
[97] rmarkdown_1.8 grid_3.4.4 readxl_1.0.0 data.table_1.10.4-3
[101] git2r_0.21.0 digest_0.6.15 xtable_1.8-2 stats4_3.4.4
[105] munsell_0.4.3
Could you check if tidy_labels()
or as_label(..., drop.na = TRUE)
already drop negative value labels.
no dropping for tidy_labels
:
tidy_labels(efc$c160age_r)[1:50]
[1] 3 -98 3 3 -98 3 3 3 3 -98 3 -98 3 3 -98 3 3 3 -98 -98 3 3
[23] -98 3 -98 3 3 3 3 3 3 3 3 3 3 3 3 3 3 -98 -98 3 3 3
[45] 3 3 3 3 3 3
as_label
does not show values, but only labels (but it includes all of them, young middle and old):
as_label(efc$c160age_r, drop.na = TRUE)[1:50]
Show in New WindowClear OutputExpand/Collapse Output
[1] old middle old old middle old old old old middle old middle
[13] old old middle old old old middle middle old old middle old
[25] middle old old old old old old old old old old old
[37] old old old middle middle old old old old old old old
[49] old old
Levels: young middle old
Does haven::write_dta()
have the same issue and does not write negative labels resp. values?
I think the problem is that I need to convert the labelled (numeric) variables into factors. Factors, however, can't have negative values:
factor(c(-99, -98, 3), levels = c("young", "middle", "aged"))
#> [1] <NA> <NA> <NA>
#> Levels: young middle aged
If I do not convert variables into factor before writing, value labels are lost (at least this was true for haven until version 1.0.0 or so). So, could you check if you write your data with haven::write_dta()
, if you still face this issue? If it works, I could remove converting to factors. If not, you can't write negative values.
I definitely can't just use haven::write_dta
on a sj dataset, e.g. efc
without loosing the value labels.
library(sjmisc)
library(tidyverse)
library(sjlabelled)
data(efc)
efc <- efc %>%
select(c160age, e17age) %>%
rec(rec = "15:30=-99 [young]; 31:55=-98 [middle]; 56:max=3 [old]")
frq(efc$c160age_r)
# carer' age (x) <numeric>
# total N=908 valid N=901 mean=-51.98 sd=50.38
val label frq raw.prc valid.prc cum.prc
-99 young 48 5.29 5.33 5.33
-98 middle 442 48.68 49.06 54.38
3 old 411 45.26 45.62 100.00
NA NA 7 0.77 NA NA
tmp <- tempdir()
haven::write_dta(efc, paste0(tmp,"/df.dta"))
efc2 <- read_stata(paste0(tmp,"/df.dta"))
frq(efc2$c160age_r)
# carer' age (x) <numeric>
# total N=908 valid N=901 mean=-51.98 sd=50.38
val frq raw.prc valid.prc cum.prc
-99 48 5.29 5.33 5.33
-98 442 48.68 49.06 54.38
3 411 45.26 45.62 100.00
<NA> 7 0.77 NA NA
haven::print_labels(efc2$c160age_r)
Error: x must be a labelled vector
This is kind of expected right? Because sj packages and haven do not use value labels in the same way. But I'm not sure whether this is what you wanted me to try.
I just had another idea: maybe there is some "hacky" way round this problem, e.g. something like
- converting numerics to strings
- replacing negative sign with some unique string, e.g. (<>)
- to factor
- write
Obviously, this solution would be far from perfect, but still better than negative values disappearing completely. Also, when reading in the data afterwards, it would be possible to
- detect unique strings for negative values
- replace them with negative signs
- convert back to numeric
But I'm not sure whether this is what you wanted me to try.
Yes, exactly. Could you try that again, but convert to labelled-class before?
# need two remove these two, because labels are not of same type as vector values
efc <- efc %>% select(-n4pstu, -nur_pst) %>% map(as_labelled) %>% as_tibble()
tmp <- tempdir()
haven::write_dta(efc, paste0(tmp,"/df.dta"))
I guess that this will drop variable labels... That's why I convert labelled vectors to factors with variable names as labels.
looks alright to me:
library(sjmisc)
library(tidyverse)
library(sjlabelled)
data(efc)
efc <- efc %>%
select(c160age, e17age) %>%
rec(rec = "15:30=-99 [young]; 31:55=-98 [middle]; 56:max=3 [old]") %>%
add_columns(efc)
efc <- efc %>% select(-n4pstu, -nur_pst) %>% map(as_labelled) %>% as_tibble()
tmp <- tempdir()
haven::write_dta(efc, paste0(tmp,"/df.dta"))
efc2 <- read_stata(paste0(tmp,"/df.dta"))
frq(efc2$c160age_r)
# carer' age (x) <numeric>
# total N=908 valid N=901 mean=-51.98 sd=50.38
val label frq raw.prc valid.prc cum.prc
-99 young 48 5.29 5.33 5.33
-98 middle 442 48.68 49.06 54.38
3 old 411 45.26 45.62 100.00
NA NA 7 0.77 NA NA
sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 18.3
Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] sjlabelled_1.0.10 forcats_0.3.0 stringr_1.3.0 dplyr_0.7.4 purrr_0.2.4 readr_1.1.1
[7] tidyr_0.8.0 tibble_1.4.2 ggplot2_2.2.1.9000 tidyverse_1.2.1 sjmisc_2.7.1.9000
loaded via a namespace (and not attached):
[1] Rcpp_0.12.16 cellranger_1.1.0 pillar_1.2.2 compiler_3.4.4 plyr_1.8.4 bindr_0.1
[7] tools_3.4.4 lubridate_1.7.2 jsonlite_1.5 gtable_0.2.0 nlme_3.1-131 lattice_0.20-35
[13] pkgconfig_2.0.1 rlang_0.2.0.9001 psych_1.8.3.3 cli_1.0.0 rstudioapi_0.7 yaml_2.1.18
[19] parallel_3.4.4 haven_1.1.0 bindrcpp_0.2 withr_2.1.2 xml2_1.2.0 httr_1.3.1
[25] knitr_1.20.2 hms_0.4.2 grid_3.4.4 tidyselect_0.2.4 glue_1.2.0 snakecase_0.9.1
[31] data.table_1.10.4-3 R6_2.2.2 readxl_1.0.0 foreign_0.8-69 prediction_0.3.2 modelr_0.1.1
[37] reshape2_1.4.3 magrittr_1.5 scales_0.5.0.9000 rvest_0.3.2 stringdist_0.9.4.7 assertthat_0.2.0
[43] mnormt_1.5-5 colorspace_1.3-2 stringi_1.1.7 lazyeval_0.2.1 munsell_0.4.3 broom_0.4.4
[49] crayon_1.3.4
ok, seems that haven has fixed the issues with variable names, at least for Stata. I must check if this also applies to SPSS.
Btw, you don't need select()
and add_columns()
in your example, as rec()
automatically selects and appends:
efc <- efc %>%
rec(c160age, e17age, rec = "15:30=-99 [young]; 31:55=-98 [middle]; 56:max=3 [old]")
:-)
Oh cool, that's useful :)
Just saw that haven behaves differently in version 1.1.11, because for me, your code does not work as intended.
hmm maybe it would be worth waiting then until Hadley responds to some of the issues related to 1.1.1.? It seems as if some bugs got introduced in this version.
seems like Hadley is working on haven right now, you may want to check if this issue still exists (using the current dev-versions of sjlabelled and haven).
Any news on this issue?
sorry I'm currently at a workshop and can't really dig into that. I'll try to check in July.
bump
Problem solved!
> data(efc)
>
> efc <- efc %>%
+ select(c160age, e17age) %>%
+ rec(rec = "15:30=-99 [young]; 31:55=-98 [middle]; 56:max=3 [old]")
>
> frq(efc$c160age_r)
# carer' age (x) <numeric>
# total N=908 valid N=901 mean=-51.98 sd=50.38
val label frq raw.prc valid.prc cum.prc
-99 young 48 5.29 5.33 5.33
-98 middle 442 48.68 49.06 54.38
3 old 411 45.26 45.62 100.00
NA NA 7 0.77 NA NA
>
>
>
> tmp <- tempdir()
> write_stata(efc, paste0(tmp,"/df.dta"))
Tidying value labels. Please wait...
Writing stata file to 'C:\Users\kasus\AppData\Local\Temp\RtmpsvxSxb/df.dta'. Please wait...
> efc2 <- read_stata(paste0(tmp,"/df.dta"))
| | 0%Converting labelled-classes. Please wait...
|======================================================================| 100%
>
> frq(efc2$c160age_r)
# carer' age (x) <numeric>
# total N=908 valid N=901 mean=-51.98 sd=50.38
val label frq raw.prc valid.prc cum.prc
-99 young 48 5.29 5.33 5.33
-98 middle 442 48.68 49.06 54.38
3 old 411 45.26 45.62 100.00
NA NA 7 0.77 NA NA