g.test doesn't handle zero counts

Question

g.test doesn't handle zero counts

jtufto opened this issue a year ago · comments

The g.test function in your package fails when some of the counts are zero. Example:

x <- matrix(c(2,5,0,3),2)
AMR::g.test(x)
#> Warning in AMR::g.test(x): `fisher.test()` is always more reliable for 2x2
#> tables and although much slower, often only takes seconds.
#> Warning in AMR::g.test(x): G-statistic approximation may be incorrect due to E <
#> 5
#> 
#>  G-test of independence
#> 
#> data:  x
#> X-squared = NaN, p-value = NA

Matthijs Berends · Answer 1 · Thu Jan 26 2023 01:33:00 GMT+0800 (China Standard Time)

Thanks for reaching out! I’ll look into it.

Matthijs Berends · Answer 2 · Mon Jan 30 2023 15:54:50 GMT+0800 (China Standard Time)

I think I found a solution, but I'm curious to hear from your perspective if I did it statistically right.

The calculation of the statistic was:

STATISTIC <- 2 * sum(x * log(x / E), na.rm = TRUE)
# (for stats::chisq.test() this is: sum((abs(x - E) - YATES)^2/E)

where E is the matrix:

#>      [,1] [,2]
#> [1,]  1.4  0.6
#> [2,]  5.6  2.4

# to recreate:
E <- matrix(c(0.3566749, -0.1133287, -Inf, 0.2231436), nrow = 2)

So then removing NAs from the calculation returns a statistic, but the log is of course not working for the -Inf:

# previously
2 * sum(x * log(x / E))
#> [1] NaN
#> Warning message:
#> In log(x/E) : NaNs produced

# with na.rm = TRUE added
2 * sum(x * log(x / E), na.rm = TRUE)
#> [1] 22.48762
#> Warning message:
#> In log(x/E) : NaNs produced

So can we determine the statistic this way, or should we cope a different way with the (negative) infinite value in E?

Jarle Tufto · Answer 3 · Mon Jan 30 2023 18:31:07 GMT+0800 (China Standard Time)

Yes, that should work I think. The LRT statistic involves the maximum likelihood under a saturated model where all the probabilities are estimated as $\hat p_{ij}=x_{ij}/n$. The likelihood contribution from cells with zero counts is therefore $\hat p_{ij}^{x_{ij}}=0^0=1$ and the log-likelihood contribution is $\log(0^0)=\log(1)=0$ so just removing the term when you get NaN should do the trick.

Matthijs Berends · Answer 4 · Tue Jan 31 2023 00:36:43 GMT+0800 (China Standard Time)

Yes, exactly. Many thanks for coming back! Fixed in 89a5778.