Several issues with sjlabelled (missing value labels) and other sjverse packages

Question

Several issues with sjlabelled (missing value labels) and other sjverse packages

cschwem2er opened this issue 6 years ago · comments

Hi Daniel,

while using the sjverse packages for teaching data analysis this term, we noticed some issues with current versions, starting with missing values labels after reading in a Stata dataset. I included all issues in a markdownfile, which is available here.

As for the issue specific related to sjlabelled, it seems as if value labels get lost somehow, although they definitely exist in the original Stata dataset:

library(tidyverse)
library(sjPlot)
library(sjmisc)
library(sjlabelled)
library(sjstats)
setwd('C:/Dropbox/lehre/Methoden der politischen Soziologie/0 Daten/GLES Vorwahl')

d <- read_stata("GLES_Vorwahlquerschnitt_ZA5700_v1-0-0.dta")

frq(d$q11bb)

# Beabsichtigte Stimmabgabe: Zweitstimme (Version B) (x) <numeric>
# total N=2001  valid N=2001  mean=-18.09  sd=70.99

  val frq raw.prc valid.prc cum.prc
  -99 109    5.45      5.45    5.45
  -98 176    8.80      8.80   14.24
  -97 313   15.64     15.64   29.89
  -83  12    0.60      0.60   30.48
    1 556   27.79     27.79   58.27
    4 376   18.79     18.79   77.06
    5  79    3.95      3.95   81.01
    6 161    8.05      8.05   89.06
    7 144    7.20      7.20   96.25
  171   2    0.10      0.10   96.35
  180   4    0.20      0.20   96.55
  206   5    0.25      0.25   96.80
  209   3    0.15      0.15   96.95
  215  31    1.55      1.55   98.50
  225   1    0.05      0.05   98.55
  237   2    0.10      0.10   98.65
  322  27    1.35      1.35  100.00
 <NA>   0    0.00        NA      NA

The dataset for reproduction and a sessionInfo() output are available in the rmarkdown file linked above.

Daniel · Answer 1 · Tue Apr 24 2018 19:48:25 GMT+0800 (China Standard Time)

Problem 1) Might be this issue: tidyverse/haven#359
Problem 2) Can't reproduce. As I have revised descr(), it might be, that this issue no longer exists in the dev-version of sjmisc
Problem 3) I think this is due to a wrong call to grpmean(). The function requires two variables, the numeric (for mean) and the categorical (for groups). You just defined the first variable. This example works for me:

d %>% 
    group_by(q1) %>% 
    grpmean(q62, q3)

Carsten Schwemmer · Answer 2 · Sat Apr 28 2018 16:48:37 GMT+0800 (China Standard Time)

Thanks for investigating.

Problem 1: really seems to be a haven issue, we tried the earlier haven version 1.1.0 and did not experience any issues with missing value labels. I really hope the haven guys will fix this soon, as this also affects many sjverse users.
Problem 2: does occur with the current CRAN version of sjmisc, but not with the dev-version. Could you please consider pushing this to CRAN soon? As grouping and descriptives are very common procedures, I think this is an important bug fix.
Problem 3: I know that grpmean is supposed to take two variables as input. My idea was that it automatically detects when a grouped object is handed over and then uses the grouping variable properly, such that only one additional variable needs to be defined. But this is really just a minor "nice to have idea" and not important.

Feel free to close this issue :)

Daniel · Answer 3 · Sat Apr 28 2018 17:05:30 GMT+0800 (China Standard Time)

The idea behind grouping and grpmeans() is that you can compute "grouped means" for subgroups of a data frame. So grouping data frames would not make sense if I use this group structure as "grouping variable" for grpmean(), or am I confusing something here?

Daniel · Answer 4 · Sat Apr 28 2018 17:07:18 GMT+0800 (China Standard Time)

Publishing the next round of updates to CRAN is planned, will occur due to the next week, I think.

Carsten Schwemmer · Answer 5 · Sat Apr 28 2018 17:13:12 GMT+0800 (China Standard Time)

The idea behind grouping and grpmeans() is that you can compute "grouped means" for subgroups of a data frame. So grouping data frames would not make sense if I use this group structure as "grouping variable" for grpmean(), or am I confusing something here?

No sorry, you are right. My use case was really just the mean of a variable for each group in a grouped dataframe, not means of subgroups of an already grouped dataframe. Combining both in one function would probably not be a good idea.