strengejacke / sjlabelled

Working with Labelled Data in R

Home Page:https://strengejacke.github.io/sjlabelled

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

write_stata drops negative value labels

cschwem2er opened this issue · comments

Hi,

writing a Stata dataframe replaces negative value labels:

library(sjmisc)
library(tidyverse)
library(sjlabelled)
data(efc)

efc <- efc %>%
  select(c160age, e17age) %>%
  rec(rec = "15:30=-99 [young]; 31:55=-98 [middle]; 56:max=3 [old]")

frq(efc$c160age_r)

# carer' age (x) <numeric> 
# total N=908  valid N=901  mean=-51.98  sd=50.38
 
 val  label frq raw.prc valid.prc cum.prc
 -99  young  48    5.29      5.33    5.33
 -98 middle 442   48.68     49.06   54.38
   3    old 411   45.26     45.62  100.00
  NA     NA   7    0.77        NA      NA


tmp <- tempdir()
write_stata(efc, paste0(tmp,"/df.dta"))
efc2 <- read_stata(paste0(tmp,"/df.dta"))

frq(efc2$c160age_r)

# carer' age (x) <numeric> 
# total N=908  valid N=901  mean=2.40  sd=0.59
 
 val  label frq raw.prc valid.prc cum.prc
   1  young  48    5.29      5.33    5.33
   2 middle 442   48.68     49.06   54.38
   3    old 411   45.26     45.62  100.00
  NA     NA   7    0.77        NA      NA

I'm note sure whether this is related to tidyverse/haven#367, because for that issue, the Stata file seems to be correct. I checked the file produced with write_stata and it already contains the incorrect labels.

sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 18.3

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bindrcpp_0.2       sjstats_0.14.2-4   sjlabelled_1.0.10  sjmisc_2.7.1.9000 
 [5] sjPlot_2.4.1.9000  forcats_0.3.0      stringr_1.3.0      dplyr_0.7.4       
 [9] purrr_0.2.4        readr_1.1.1        tidyr_0.8.0        tibble_1.4.2      
[13] ggplot2_2.2.1.9000 tidyverse_1.2.1   

loaded via a namespace (and not attached):
  [1] TH.data_1.0-8       minqa_1.2.4         colorspace_1.3-2    modeltools_0.2-21  
  [5] ggridges_0.5.0      rsconnect_0.8.5     rprojroot_1.3-2     htmlTable_1.11.2   
  [9] estimability_1.3    snakecase_0.9.1     base64enc_0.1-3     rstudioapi_0.7     
 [13] glmmTMB_0.2.0       mvtnorm_1.0-7       lubridate_1.7.2     coin_1.2-2         
 [17] xml2_1.2.0          codetools_0.2-15    splines_3.4.4       mnormt_1.5-5       
 [21] knitr_1.20.2        effects_4.0-1       bayesplot_1.5.0     Formula_1.2-2      
 [25] jsonlite_1.5        nloptr_1.0.4        ggeffects_0.3.2     broom_0.4.4        
 [29] cluster_2.0.6       compiler_3.4.4      httr_1.3.1          emmeans_1.1.3      
 [33] backports_1.1.2     assertthat_0.2.0    Matrix_1.2-12       lazyeval_0.2.1     
 [37] survey_3.33-2       cli_1.0.0           acepack_1.4.1       htmltools_0.3.6    
 [41] tools_3.4.4         coda_0.19-1         gtable_0.2.0        glue_1.2.0         
 [45] reshape2_1.4.3      Rcpp_0.12.16        carData_3.0-1       cellranger_1.1.0   
 [49] nlme_3.1-131        psych_1.8.3.3       lmtest_0.9-36       lme4_1.1-17        
 [53] rvest_0.3.2         devtools_1.13.5     stringdist_0.9.4.7  MASS_7.3-48        
 [57] zoo_1.8-1           scales_0.5.0.9000   hms_0.4.2           parallel_3.4.4     
 [61] sandwich_2.4-0      pwr_1.2-2           TMB_1.7.12          RColorBrewer_1.1-2 
 [65] yaml_2.1.16         curl_3.1            gridExtra_2.3       memoise_1.1.0      
 [69] rpart_4.1-12        latticeExtra_0.6-28 stringi_1.1.7       highr_0.6          
 [73] checkmate_1.8.5     rlang_0.2.0.9001    pkgconfig_2.0.1     arm_1.10-1         
 [77] evaluate_0.10.1     lattice_0.20-35     prediction_0.3.2    bindr_0.1          
 [81] htmlwidgets_1.0     tidyselect_0.2.4    plyr_1.8.4          magrittr_1.5       
 [85] R6_2.2.2            Hmisc_4.1-1         multcomp_1.4-8      pillar_1.2.2       
 [89] haven_1.1.0         foreign_0.8-69      withr_2.1.2         survival_2.41-3    
 [93] abind_1.4-5         nnet_7.3-12         modelr_0.1.1        crayon_1.3.4       
 [97] rmarkdown_1.8       grid_3.4.4          readxl_1.0.0        data.table_1.10.4-3
[101] git2r_0.21.0        digest_0.6.15       xtable_1.8-2        stats4_3.4.4       
[105] munsell_0.4.3      

Could you check if tidy_labels() or as_label(..., drop.na = TRUE) already drop negative value labels.

no dropping for tidy_labels:

tidy_labels(efc$c160age_r)[1:50]
 [1]   3 -98   3   3 -98   3   3   3   3 -98   3 -98   3   3 -98   3   3   3 -98 -98   3   3
[23] -98   3 -98   3   3   3   3   3   3   3   3   3   3   3   3   3   3 -98 -98   3   3   3
[45]   3   3   3   3   3   3

as_label does not show values, but only labels (but it includes all of them, young middle and old):

as_label(efc$c160age_r, drop.na = TRUE)[1:50]

Show in New WindowClear OutputExpand/Collapse Output
 [1] old    middle old    old    middle old    old    old    old    middle old    middle
[13] old    old    middle old    old    old    middle middle old    old    middle old   
[25] middle old    old    old    old    old    old    old    old    old    old    old   
[37] old    old    old    middle middle old    old    old    old    old    old    old   
[49] old    old   
Levels: young middle old

Does haven::write_dta() have the same issue and does not write negative labels resp. values?

I think the problem is that I need to convert the labelled (numeric) variables into factors. Factors, however, can't have negative values:

factor(c(-99, -98, 3), levels = c("young", "middle", "aged"))
#> [1] <NA> <NA> <NA>
#> Levels: young middle aged

If I do not convert variables into factor before writing, value labels are lost (at least this was true for haven until version 1.0.0 or so). So, could you check if you write your data with haven::write_dta(), if you still face this issue? If it works, I could remove converting to factors. If not, you can't write negative values.

I definitely can't just use haven::write_dta on a sj dataset, e.g. efc without loosing the value labels.

library(sjmisc)
library(tidyverse)
library(sjlabelled)
data(efc)

efc <- efc %>%
  select(c160age, e17age) %>%
  rec(rec = "15:30=-99 [young]; 31:55=-98 [middle]; 56:max=3 [old]")

frq(efc$c160age_r)

# carer' age (x) <numeric> 
# total N=908  valid N=901  mean=-51.98  sd=50.38
 
 val  label frq raw.prc valid.prc cum.prc
 -99  young  48    5.29      5.33    5.33
 -98 middle 442   48.68     49.06   54.38
   3    old 411   45.26     45.62  100.00
  NA     NA   7    0.77        NA      NA


tmp <- tempdir()
haven::write_dta(efc, paste0(tmp,"/df.dta"))
efc2 <- read_stata(paste0(tmp,"/df.dta"))
frq(efc2$c160age_r)

# carer' age (x) <numeric> 
# total N=908  valid N=901  mean=-51.98  sd=50.38
 
  val frq raw.prc valid.prc cum.prc
  -99  48    5.29      5.33    5.33
  -98 442   48.68     49.06   54.38
    3 411   45.26     45.62  100.00
 <NA>   7    0.77        NA      NA

haven::print_labels(efc2$c160age_r)

Error: x must be a labelled vector

This is kind of expected right? Because sj packages and haven do not use value labels in the same way. But I'm not sure whether this is what you wanted me to try.

I just had another idea: maybe there is some "hacky" way round this problem, e.g. something like

  1. converting numerics to strings
  2. replacing negative sign with some unique string, e.g. (<>)
  3. to factor
  4. write

Obviously, this solution would be far from perfect, but still better than negative values disappearing completely. Also, when reading in the data afterwards, it would be possible to

  1. detect unique strings for negative values
  2. replace them with negative signs
  3. convert back to numeric

But I'm not sure whether this is what you wanted me to try.

Yes, exactly. Could you try that again, but convert to labelled-class before?

# need two remove these two, because labels are not of same type as vector values
efc <- efc %>% select(-n4pstu, -nur_pst) %>% map(as_labelled) %>% as_tibble()
tmp <- tempdir()
haven::write_dta(efc, paste0(tmp,"/df.dta"))

I guess that this will drop variable labels... That's why I convert labelled vectors to factors with variable names as labels.

looks alright to me:

library(sjmisc)
library(tidyverse)
library(sjlabelled)
data(efc)

efc <- efc %>%
  select(c160age, e17age) %>%
  rec(rec = "15:30=-99 [young]; 31:55=-98 [middle]; 56:max=3 [old]") %>% 
  add_columns(efc)

efc <- efc %>% select(-n4pstu, -nur_pst) %>% map(as_labelled) %>% as_tibble()
tmp <- tempdir()
haven::write_dta(efc, paste0(tmp,"/df.dta"))


efc2 <- read_stata(paste0(tmp,"/df.dta"))
frq(efc2$c160age_r)
# carer' age (x) <numeric> 
# total N=908  valid N=901  mean=-51.98  sd=50.38
 
 val  label frq raw.prc valid.prc cum.prc
 -99  young  48    5.29      5.33    5.33
 -98 middle 442   48.68     49.06   54.38
   3    old 411   45.26     45.62  100.00
  NA     NA   7    0.77        NA      NA


sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 18.3

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] sjlabelled_1.0.10  forcats_0.3.0      stringr_1.3.0      dplyr_0.7.4        purrr_0.2.4        readr_1.1.1       
 [7] tidyr_0.8.0        tibble_1.4.2       ggplot2_2.2.1.9000 tidyverse_1.2.1    sjmisc_2.7.1.9000 

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.16        cellranger_1.1.0    pillar_1.2.2        compiler_3.4.4      plyr_1.8.4          bindr_0.1          
 [7] tools_3.4.4         lubridate_1.7.2     jsonlite_1.5        gtable_0.2.0        nlme_3.1-131        lattice_0.20-35    
[13] pkgconfig_2.0.1     rlang_0.2.0.9001    psych_1.8.3.3       cli_1.0.0           rstudioapi_0.7      yaml_2.1.18        
[19] parallel_3.4.4      haven_1.1.0         bindrcpp_0.2        withr_2.1.2         xml2_1.2.0          httr_1.3.1         
[25] knitr_1.20.2        hms_0.4.2           grid_3.4.4          tidyselect_0.2.4    glue_1.2.0          snakecase_0.9.1    
[31] data.table_1.10.4-3 R6_2.2.2            readxl_1.0.0        foreign_0.8-69      prediction_0.3.2    modelr_0.1.1       
[37] reshape2_1.4.3      magrittr_1.5        scales_0.5.0.9000   rvest_0.3.2         stringdist_0.9.4.7  assertthat_0.2.0   
[43] mnormt_1.5-5        colorspace_1.3-2    stringi_1.1.7       lazyeval_0.2.1      munsell_0.4.3       broom_0.4.4        
[49] crayon_1.3.4       

ok, seems that haven has fixed the issues with variable names, at least for Stata. I must check if this also applies to SPSS.

Btw, you don't need select() and add_columns() in your example, as rec() automatically selects and appends:

efc <- efc %>%
  rec(c160age, e17age, rec = "15:30=-99 [young]; 31:55=-98 [middle]; 56:max=3 [old]")

:-)

Oh cool, that's useful :)

Just saw that haven behaves differently in version 1.1.11, because for me, your code does not work as intended.

hmm maybe it would be worth waiting then until Hadley responds to some of the issues related to 1.1.1.? It seems as if some bugs got introduced in this version.

seems like Hadley is working on haven right now, you may want to check if this issue still exists (using the current dev-versions of sjlabelled and haven).

Any news on this issue?

sorry I'm currently at a workshop and can't really dig into that. I'll try to check in July.

bump

Problem solved!

> data(efc)
> 
> efc <- efc %>%
+   select(c160age, e17age) %>%
+   rec(rec = "15:30=-99 [young]; 31:55=-98 [middle]; 56:max=3 [old]")
> 
> frq(efc$c160age_r)

# carer' age (x) <numeric> 
# total N=908  valid N=901  mean=-51.98  sd=50.38
 
 val  label frq raw.prc valid.prc cum.prc
 -99  young  48    5.29      5.33    5.33
 -98 middle 442   48.68     49.06   54.38
   3    old 411   45.26     45.62  100.00
  NA     NA   7    0.77        NA      NA


> 
> 
> 
> tmp <- tempdir()
> write_stata(efc, paste0(tmp,"/df.dta"))
Tidying value labels. Please wait...
Writing stata file to 'C:\Users\kasus\AppData\Local\Temp\RtmpsvxSxb/df.dta'. Please wait...
> efc2 <- read_stata(paste0(tmp,"/df.dta"))
  |                                                                      |   0%Converting labelled-classes. Please wait...

  |======================================================================| 100%
> 
> frq(efc2$c160age_r)

# carer' age (x) <numeric> 
# total N=908  valid N=901  mean=-51.98  sd=50.38
 
 val  label frq raw.prc valid.prc cum.prc
 -99  young  48    5.29      5.33    5.33
 -98 middle 442   48.68     49.06   54.38
   3    old 411   45.26     45.62  100.00
  NA     NA   7    0.77        NA      NA