elbersb / segregation

R package to calculate entropy-based segregation indices, with a focus on the Mutual Information Index (M) and Theil’s Information Index (H)

Home Page:https://elbersb.com/segregation

Repository from Github https://github.comelbersb/segregationRepository from Github https://github.comelbersb/segregation

Negative local segregation values in the decomposition into racial groups

kaisarea opened this issue · comments

Hello,
I have the following 'dataset' called local_data (trying to create a reproducible example here):

# A tibble: 14 x 3
   SCHOOLID   group   count
   <chr>      <chr>   <dbl>
 1 100005_870 WHITE     669
 2 100005_870 BLACK      12
 3 100005_870 HISP       80
 4 100005_870 AIAN        0
 5 100005_870 ASIAN       2
 6 100005_870 PACIFIC    16
 7 100005_870 TR         25
 8 100005_871 WHITE     703
 9 100005_871 BLACK      12
10 100005_871 HISP       47
11 100005_871 AIAN        0
12 100005_871 ASIAN       2
13 100005_871 PACIFIC     0
14 100005_871 TR         27

Then I run:

mutual_local(local_data, "SCHOOLID", "group", weight = "count", wide = TRUE)
I get the following output:

     group         ls           p
1:   ASIAN  5.2951875 0.002507837
2:   BLACK  3.5034280 0.015047022
3:    HISP  1.8714444 0.079623824
4: PACIFIC  4.6020403 0.010031348
5:      TR  2.7309779 0.032601881
6:   WHITE -0.5422359 0.860188088

My question is how does one interpret negative values from the mutual_local() function? I actually even had all components being negative (I can try to create a reproducible example for that too if needed). What is the interpretation of a zero, positive, and negative values here?

Hi, thanks for the issue. The local segregation scores can't be negative, so you found a bug. The problem is that your variable is named "group", and the package doesn't deal well with that. If you use "race", for instance, the problem goes away:

library(tibble)
library(segregation)
options(scipen=5)

local_data = tribble(~SCHOOLID, ~race, ~count,
"100005_870", "WHITE",     669,
"100005_870", "BLACK",      12,
"100005_870", "HISP",       80,
"100005_870", "AIAN",        0,
"100005_870", "ASIAN",       2,
"100005_870", "PACIFIC",    16,
"100005_870", "TR",         25,
"100005_871", "WHITE",     703,
"100005_871", "BLACK",      12,
"100005_871", "HISP",       47,
"100005_871", "AIAN",        0,
"100005_871", "ASIAN",       2,
"100005_871", "PACIFIC",     0,
"100005_871", "TR",         27)

(mutual_local(local_data, "SCHOOLID", "race", weight = "count", wide = TRUE))
#>       race            ls           p
#> 1:   ASIAN 0.00003321619 0.002507837
#> 2:   BLACK 0.00003321619 0.015047022
#> 3:    HISP 0.03206493691 0.079623824
#> 4: PACIFIC 0.68502974604 0.010031348
#> 5:      TR 0.00108653019 0.032601881
#> 6:   WHITE 0.00054228911 0.860188088

Created on 2021-10-24 by the reprex package (v2.0.1)

I'll try to fix that issue soon.

It's working now, thank you!

Hello,
I have the same problem, but I can't resolve it whit the names changes. In my case, the problem arises when I use the "se" argument and the function make the bias corrections.
Here is the code:

library(tidyverse)
library(segregation)
base  <- tribble(~ID_s, ~PRI, ~SEC, ~SUP,
                 1,     4,     4,     6,
                 2,    27,    34,    36,
                 3,     9,    15,    15,
                 4,    21,    33,    38,
                 5,    15,    23,    19,
                 6,     6,     8,     6,
                 7,     7,    14,    18,
                 8,     6,     8,    12,
                 9,    23,    34,    45,
                 10,    9,    16,    19
                 )
base |> 
  pivot_longer(cols = PRI:SUP, names_to = "EDU", 
               values_to = "n") |> 
  mutual_local(group = "EDU", unit = "ID_s",
               weight = "n", se = T, 
               wide = T ) |>
   select(ID_s, p, ls)

And this is my output:

    ID_s            p              ls
1:    1     0.02539623     -0.072997966
2:    2     0.18315094     -0.002555072
3:    3     0.07269811     -0.024815143
4:    4     0.17362264     -0.010504312
5:    5     0.10986792     -0.004451141
6:    6     0.03701887     -0.019953732
7:    7     0.07281132     -0.010572342
8:    8     0.04958491     -0.036701383
9:    9     0.19315094     -0.004720493
10:   10   0.08269811     -0.017325337

The problem disappear when I select "se = F".

Thank you!

Hi, yes that can happen when your sample is small. Basically this means that your ls scores are most likely exactly zero. I could probably just set them to 0 manually if this occurs, but I think this is probably more transparent. This is just something that can happen with the combination of bootstrap and bias correction when the parameters are close to 0. Maybe it would be good to have a FAQ entry about this, though.

Perfect. I did this manually but was not sure if it was correct.
Thank you for your response and your work with this package!