strengejacke / sjlabelled

Working with Labelled Data in R

Home Page:https://strengejacke.github.io/sjlabelled

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

different behavior for character vs. numeric data, when using named vector for labels argument of set_labels()?

jmobrien opened this issue · comments

Using 1.1.4.

The documentation for set_labels suggests that the format for a named vector being passed to the labels argument is:

c([desiredlabel1] = [datavalue1], ..., [desiredlabel{n}] = [datavalue{n}])

e.g. from Examples to set_labels():

# assign labels with named vector
dummy <- sample(1:4, 40, replace = TRUE)
dummy <- set_labels(dummy, labels = c("very low" = 1, "very high" = 4))

This implies that the actual data values to be labelled are the elements of the labels, and the labels to be applied are the names of those elements. However, in practice, this gets reversed when the thing being labeled is a character vector.

See below:

# numeric version
numvec <- 
  set_labels(1:4, 
  labels = c(a = 1, b = 2, c = 3, d = 4)
  )

numvec
[1] 1 2 3 4
attr(,"labels")
   a b c d   <-- labels are from the names attributes of vector passed to "labels"
   1 2 3 4   <-- data values that are labelled come from elements vector passed to "labels"

get_labels(numvec)

# character version
charvec <- 
     set_labels(c("one", "two", "three", "four"), 
     labels = c(a = "one", b = "two", c = "three", d = "four")
     )

charvec
[1] "one" "two" "three" "four"
attr(,"labels")
   one  two  three  four    <-- here, labels come from the *elements* of vector passed to "labels"
   "a" "b" "c" "d"              <-- meanwhile,  data values that are labelled come from the *names* of vector passed to "labels" 

when this is done with character vectors, get_labels() then produces the wrong values for "labels", providing the values in the data instead:

# This is fine:
get_labels(numvec)
[1] "a" "b" "c" "d"

# This is not:
get_labels(charvec)
[1] "one" "two" "three" "four"  <-- again, these are the values, not the labels

Is this a mistake, or is there something about intended behavior I'm not understanding?

It's showing up as an issue for me b/c I have a situation where the labels are generally serving multiple roles for Stata compatibility and helping merge datasets, but in a few cases I also want to provide metadata about more complex classification to my userbase, e.g.:

set_labels(c("TT", "CC", "TC", "CT", "TX", "XT", "CX", "XC", "XX"),
           labels = c(Treatment = "TT", Control = "CC",
                      Mixed = "CT", Mixed = "TC", 
                      PartialTreatment = "TX", PartialTreatment = "XT",
                      PartialControl = "TX", PartialControl = "XT",
                      Missing = "XX")
)

Just following up about this after a while. I'd like to use the tool more in my workflow, but as it stands I'm just having to set things up manually.

I still don't understand the seemingly inconsistent behavior when labeling numeric vs. character data types. Looking at the code this behavior appears to be intentional, with the linked section doing the flip if things match (i.e., string data is given string labels).

If there's a mismatch because the labels are given numeric elements with string names, it ignores those and throws a warning, like so:

  example <- sample(c("one", "two", "three", "four"), 40, replace = TRUE)
  example <- set_labels(dummy, labels = c("one" = 1 , "two" = 2, "three" = 3 , "four" = 3))

  example   # unlabelled               

However, a labels argument structured like the above would already be improper based on the guidance in the help docs for set_labels.

In fact, it even looks like a similar mistake would be fixed if it were made when labeling a numeric vector, i.e. here

Is there a purpose to this behavior I'm not understanding? Perhaps something to do with Haven, etc.?

I think there must have been a reason to do this, but I can't remember. I changed the behaviour, so it is now in line with the default haven behaviour:

charvec <- sjlabelled::set_labels(
  c("one", "two", "three", "four"), 
  labels = c(a = "one", b = "two", c = "three", d = "four")
)

charvec2 <-  haven::labelled(
  c("one", "two", "three", "four"), 
  labels = c(a = "one", b = "two", c = "three", d = "four")
)

charvec
#> [1] "one"   "two"   "three" "four" 
#> attr(,"labels")
#>       a       b       c       d 
#>   "one"   "two" "three"  "four"
charvec2
#> <labelled<character>[4]>
#> [1] one   two   three four 
#> 
#> Labels:
#>  value label
#>    one     a
#>    two     b
#>  three     c
#>   four     d

sjlabelled::get_labels(charvec, value = "p")
#> [1] "[four] d"  "[one] a"   "[three] c" "[two] b"
sjlabelled::get_labels(charvec2, value = "p")
#> [1] "[four] d"  "[one] a"   "[three] c" "[two] b"

Created on 2021-05-11 by the reprex package (v2.0.0)

Excellent, thanks for your help on this.

And thanks for all your work on this package overall!

Just a small follow-up on this--Is there a plan to release a new version of the package CRAN any time soon? Just asking so I can plan--much of my work is done on a managed server, and the policy is generally only to use official CRAN packages.

I just submitted an update to CRAN, and this resolved issue should be in the official CRAN release.