write_stata drops negative value labels

Question

write_stata drops negative value labels

cschwem2er opened this issue 6 years ago · comments

Hi,

writing a Stata dataframe replaces negative value labels:

library(sjmisc)
library(tidyverse)
library(sjlabelled)
data(efc)

efc <- efc %>%
  select(c160age, e17age) %>%
  rec(rec = "15:30=-99 [young]; 31:55=-98 [middle]; 56:max=3 [old]")

frq(efc$c160age_r)

# carer' age (x) <numeric> 
# total N=908  valid N=901  mean=-51.98  sd=50.38
 
 val  label frq raw.prc valid.prc cum.prc
 -99  young  48    5.29      5.33    5.33
 -98 middle 442   48.68     49.06   54.38
   3    old 411   45.26     45.62  100.00
  NA     NA   7    0.77        NA      NA


tmp <- tempdir()
write_stata(efc, paste0(tmp,"/df.dta"))
efc2 <- read_stata(paste0(tmp,"/df.dta"))

frq(efc2$c160age_r)

# carer' age (x) <numeric> 
# total N=908  valid N=901  mean=2.40  sd=0.59
 
 val  label frq raw.prc valid.prc cum.prc
   1  young  48    5.29      5.33    5.33
   2 middle 442   48.68     49.06   54.38
   3    old 411   45.26     45.62  100.00
  NA     NA   7    0.77        NA      NA

I'm note sure whether this is related to tidyverse/haven#367, because for that issue, the Stata file seems to be correct. I checked the file produced with write_stata and it already contains the incorrect labels.

sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 18.3

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bindrcpp_0.2       sjstats_0.14.2-4   sjlabelled_1.0.10  sjmisc_2.7.1.9000 
 [5] sjPlot_2.4.1.9000  forcats_0.3.0      stringr_1.3.0      dplyr_0.7.4       
 [9] purrr_0.2.4        readr_1.1.1        tidyr_0.8.0        tibble_1.4.2      
[13] ggplot2_2.2.1.9000 tidyverse_1.2.1   

loaded via a namespace (and not attached):
  [1] TH.data_1.0-8       minqa_1.2.4         colorspace_1.3-2    modeltools_0.2-21  
  [5] ggridges_0.5.0      rsconnect_0.8.5     rprojroot_1.3-2     htmlTable_1.11.2   
  [9] estimability_1.3    snakecase_0.9.1     base64enc_0.1-3     rstudioapi_0.7     
 [13] glmmTMB_0.2.0       mvtnorm_1.0-7       lubridate_1.7.2     coin_1.2-2         
 [17] xml2_1.2.0          codetools_0.2-15    splines_3.4.4       mnormt_1.5-5       
 [21] knitr_1.20.2        effects_4.0-1       bayesplot_1.5.0     Formula_1.2-2      
 [25] jsonlite_1.5        nloptr_1.0.4        ggeffects_0.3.2     broom_0.4.4        
 [29] cluster_2.0.6       compiler_3.4.4      httr_1.3.1          emmeans_1.1.3      
 [33] backports_1.1.2     assertthat_0.2.0    Matrix_1.2-12       lazyeval_0.2.1     
 [37] survey_3.33-2       cli_1.0.0           acepack_1.4.1       htmltools_0.3.6    
 [41] tools_3.4.4         coda_0.19-1         gtable_0.2.0        glue_1.2.0         
 [45] reshape2_1.4.3      Rcpp_0.12.16        carData_3.0-1       cellranger_1.1.0   
 [49] nlme_3.1-131        psych_1.8.3.3       lmtest_0.9-36       lme4_1.1-17        
 [53] rvest_0.3.2         devtools_1.13.5     stringdist_0.9.4.7  MASS_7.3-48        
 [57] zoo_1.8-1           scales_0.5.0.9000   hms_0.4.2           parallel_3.4.4     
 [61] sandwich_2.4-0      pwr_1.2-2           TMB_1.7.12          RColorBrewer_1.1-2 
 [65] yaml_2.1.16         curl_3.1            gridExtra_2.3       memoise_1.1.0      
 [69] rpart_4.1-12        latticeExtra_0.6-28 stringi_1.1.7       highr_0.6          
 [73] checkmate_1.8.5     rlang_0.2.0.9001    pkgconfig_2.0.1     arm_1.10-1         
 [77] evaluate_0.10.1     lattice_0.20-35     prediction_0.3.2    bindr_0.1          
 [81] htmlwidgets_1.0     tidyselect_0.2.4    plyr_1.8.4          magrittr_1.5       
 [85] R6_2.2.2            Hmisc_4.1-1         multcomp_1.4-8      pillar_1.2.2       
 [89] haven_1.1.0         foreign_0.8-69      withr_2.1.2         survival_2.41-3    
 [93] abind_1.4-5         nnet_7.3-12         modelr_0.1.1        crayon_1.3.4       
 [97] rmarkdown_1.8       grid_3.4.4          readxl_1.0.0        data.table_1.10.4-3
[101] git2r_0.21.0        digest_0.6.15       xtable_1.8-2        stats4_3.4.4       
[105] munsell_0.4.3

Daniel commented 6 years ago

bump

Daniel · Answer 1 · Sun Apr 29 2018 22:54:15 GMT+0800 (China Standard Time)

Could you check if tidy_labels() or as_label(..., drop.na = TRUE) already drop negative value labels.

Carsten Schwemmer · Answer 2 · Sun Apr 29 2018 23:01:28 GMT+0800 (China Standard Time)

no dropping for tidy_labels:

tidy_labels(efc$c160age_r)[1:50]
 [1]   3 -98   3   3 -98   3   3   3   3 -98   3 -98   3   3 -98   3   3   3 -98 -98   3   3
[23] -98   3 -98   3   3   3   3   3   3   3   3   3   3   3   3   3   3 -98 -98   3   3   3
[45]   3   3   3   3   3   3

as_label does not show values, but only labels (but it includes all of them, young middle and old):

as_label(efc$c160age_r, drop.na = TRUE)[1:50]

Show in New WindowClear OutputExpand/Collapse Output
 [1] old    middle old    old    middle old    old    old    old    middle old    middle
[13] old    old    middle old    old    old    middle middle old    old    middle old   
[25] middle old    old    old    old    old    old    old    old    old    old    old   
[37] old    old    old    middle middle old    old    old    old    old    old    old   
[49] old    old   
Levels: young middle old

Daniel · Answer 3 · Sun Apr 29 2018 23:36:22 GMT+0800 (China Standard Time)

Does haven::write_dta() have the same issue and does not write negative labels resp. values?

Daniel · Answer 4 · Mon Apr 30 2018 01:08:10 GMT+0800 (China Standard Time)

I think the problem is that I need to convert the labelled (numeric) variables into factors. Factors, however, can't have negative values:

factor(c(-99, -98, 3), levels = c("young", "middle", "aged"))
#> [1] <NA> <NA> <NA>
#> Levels: young middle aged

If I do not convert variables into factor before writing, value labels are lost (at least this was true for haven until version 1.0.0 or so). So, could you check if you write your data with haven::write_dta(), if you still face this issue? If it works, I could remove converting to factors. If not, you can't write negative values.

Carsten Schwemmer · Answer 5 · Mon Apr 30 2018 05:30:28 GMT+0800 (China Standard Time)

I definitely can't just use haven::write_dta on a sj dataset, e.g. efc without loosing the value labels.

library(sjmisc)
library(tidyverse)
library(sjlabelled)
data(efc)

efc <- efc %>%
  select(c160age, e17age) %>%
  rec(rec = "15:30=-99 [young]; 31:55=-98 [middle]; 56:max=3 [old]")

frq(efc$c160age_r)

# carer' age (x) <numeric> 
# total N=908  valid N=901  mean=-51.98  sd=50.38
 
 val  label frq raw.prc valid.prc cum.prc
 -99  young  48    5.29      5.33    5.33
 -98 middle 442   48.68     49.06   54.38
   3    old 411   45.26     45.62  100.00
  NA     NA   7    0.77        NA      NA


tmp <- tempdir()
haven::write_dta(efc, paste0(tmp,"/df.dta"))
efc2 <- read_stata(paste0(tmp,"/df.dta"))
frq(efc2$c160age_r)

# carer' age (x) <numeric> 
# total N=908  valid N=901  mean=-51.98  sd=50.38
 
  val frq raw.prc valid.prc cum.prc
  -99  48    5.29      5.33    5.33
  -98 442   48.68     49.06   54.38
    3 411   45.26     45.62  100.00
 <NA>   7    0.77        NA      NA

haven::print_labels(efc2$c160age_r)

Error: x must be a labelled vector

This is kind of expected right? Because sj packages and haven do not use value labels in the same way. But I'm not sure whether this is what you wanted me to try.

Carsten Schwemmer · Answer 6 · Mon Apr 30 2018 05:31:44 GMT+0800 (China Standard Time)

I just had another idea: maybe there is some "hacky" way round this problem, e.g. something like

converting numerics to strings
replacing negative sign with some unique string, e.g. (<>)
to factor
write

Obviously, this solution would be far from perfect, but still better than negative values disappearing completely. Also, when reading in the data afterwards, it would be possible to

detect unique strings for negative values
replace them with negative signs
convert back to numeric

Daniel · Answer 7 · Mon Apr 30 2018 14:12:55 GMT+0800 (China Standard Time)

But I'm not sure whether this is what you wanted me to try.

Yes, exactly. Could you try that again, but convert to labelled-class before?

# need two remove these two, because labels are not of same type as vector values
efc <- efc %>% select(-n4pstu, -nur_pst) %>% map(as_labelled) %>% as_tibble()
tmp <- tempdir()
haven::write_dta(efc, paste0(tmp,"/df.dta"))

I guess that this will drop variable labels... That's why I convert labelled vectors to factors with variable names as labels.

Carsten Schwemmer · Answer 8 · Wed May 02 2018 18:07:48 GMT+0800 (China Standard Time)

looks alright to me:

library(sjmisc)
library(tidyverse)
library(sjlabelled)
data(efc)

efc <- efc %>%
  select(c160age, e17age) %>%
  rec(rec = "15:30=-99 [young]; 31:55=-98 [middle]; 56:max=3 [old]") %>% 
  add_columns(efc)

efc <- efc %>% select(-n4pstu, -nur_pst) %>% map(as_labelled) %>% as_tibble()
tmp <- tempdir()
haven::write_dta(efc, paste0(tmp,"/df.dta"))


efc2 <- read_stata(paste0(tmp,"/df.dta"))
frq(efc2$c160age_r)
# carer' age (x) <numeric> 
# total N=908  valid N=901  mean=-51.98  sd=50.38
 
 val  label frq raw.prc valid.prc cum.prc
 -99  young  48    5.29      5.33    5.33
 -98 middle 442   48.68     49.06   54.38
   3    old 411   45.26     45.62  100.00
  NA     NA   7    0.77        NA      NA


sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 18.3

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] sjlabelled_1.0.10  forcats_0.3.0      stringr_1.3.0      dplyr_0.7.4        purrr_0.2.4        readr_1.1.1       
 [7] tidyr_0.8.0        tibble_1.4.2       ggplot2_2.2.1.9000 tidyverse_1.2.1    sjmisc_2.7.1.9000 

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.16        cellranger_1.1.0    pillar_1.2.2        compiler_3.4.4      plyr_1.8.4          bindr_0.1          
 [7] tools_3.4.4         lubridate_1.7.2     jsonlite_1.5        gtable_0.2.0        nlme_3.1-131        lattice_0.20-35    
[13] pkgconfig_2.0.1     rlang_0.2.0.9001    psych_1.8.3.3       cli_1.0.0           rstudioapi_0.7      yaml_2.1.18        
[19] parallel_3.4.4      haven_1.1.0         bindrcpp_0.2        withr_2.1.2         xml2_1.2.0          httr_1.3.1         
[25] knitr_1.20.2        hms_0.4.2           grid_3.4.4          tidyselect_0.2.4    glue_1.2.0          snakecase_0.9.1    
[31] data.table_1.10.4-3 R6_2.2.2            readxl_1.0.0        foreign_0.8-69      prediction_0.3.2    modelr_0.1.1       
[37] reshape2_1.4.3      magrittr_1.5        scales_0.5.0.9000   rvest_0.3.2         stringdist_0.9.4.7  assertthat_0.2.0   
[43] mnormt_1.5-5        colorspace_1.3-2    stringi_1.1.7       lazyeval_0.2.1      munsell_0.4.3       broom_0.4.4        
[49] crayon_1.3.4

Daniel · Answer 9 · Wed May 02 2018 19:35:03 GMT+0800 (China Standard Time)

ok, seems that haven has fixed the issues with variable names, at least for Stata. I must check if this also applies to SPSS.

Daniel · Answer 10 · Wed May 02 2018 19:59:44 GMT+0800 (China Standard Time)

Btw, you don't need select() and add_columns() in your example, as rec() automatically selects and appends:

efc <- efc %>%
  rec(c160age, e17age, rec = "15:30=-99 [young]; 31:55=-98 [middle]; 56:max=3 [old]")

:-)

Carsten Schwemmer · Answer 11 · Wed May 02 2018 20:10:10 GMT+0800 (China Standard Time)

Oh cool, that's useful :)

Daniel · Answer 12 · Wed May 02 2018 20:35:40 GMT+0800 (China Standard Time)

Just saw that haven behaves differently in version 1.1.11, because for me, your code does not work as intended.

Carsten Schwemmer · Answer 13 · Wed May 02 2018 21:03:57 GMT+0800 (China Standard Time)

hmm maybe it would be worth waiting then until Hadley responds to some of the issues related to 1.1.1.? It seems as if some bugs got introduced in this version.

Daniel · Answer 14 · Wed Jun 20 2018 20:13:35 GMT+0800 (China Standard Time)

seems like Hadley is working on haven right now, you may want to check if this issue still exists (using the current dev-versions of sjlabelled and haven).

Daniel · Answer 15 · Wed Jun 27 2018 02:49:31 GMT+0800 (China Standard Time)

Any news on this issue?

Carsten Schwemmer · Answer 16 · Wed Jun 27 2018 03:06:12 GMT+0800 (China Standard Time)

sorry I'm currently at a workshop and can't really dig into that. I'll try to check in July.

Carsten Schwemmer · Answer 17 · Wed Jul 18 2018 22:02:58 GMT+0800 (China Standard Time)

Problem solved!

> data(efc)
> 
> efc <- efc %>%
+   select(c160age, e17age) %>%
+   rec(rec = "15:30=-99 [young]; 31:55=-98 [middle]; 56:max=3 [old]")
> 
> frq(efc$c160age_r)

# carer' age (x) <numeric> 
# total N=908  valid N=901  mean=-51.98  sd=50.38
 
 val  label frq raw.prc valid.prc cum.prc
 -99  young  48    5.29      5.33    5.33
 -98 middle 442   48.68     49.06   54.38
   3    old 411   45.26     45.62  100.00
  NA     NA   7    0.77        NA      NA


> 
> 
> 
> tmp <- tempdir()
> write_stata(efc, paste0(tmp,"/df.dta"))
Tidying value labels. Please wait...
Writing stata file to 'C:\Users\kasus\AppData\Local\Temp\RtmpsvxSxb/df.dta'. Please wait...
> efc2 <- read_stata(paste0(tmp,"/df.dta"))
  |                                                                      |   0%Converting labelled-classes. Please wait...

  |======================================================================| 100%
> 
> frq(efc2$c160age_r)

# carer' age (x) <numeric> 
# total N=908  valid N=901  mean=-51.98  sd=50.38
 
 val  label frq raw.prc valid.prc cum.prc
 -99  young  48    5.29      5.33    5.33
 -98 middle 442   48.68     49.06   54.38
   3    old 411   45.26     45.62  100.00
  NA     NA   7    0.77        NA      NA