msberends / AMR

Functions to simplify and standardise antimicrobial resistance (AMR) data analysis and to work with microbial and antimicrobial properties by using evidence-based methods, as described in https://doi.org/10.18637/jss.v104.i03.

Home Page:https://msberends.github.io/AMR/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

g.test doesn't handle zero counts

jtufto opened this issue · comments

The g.test function in your package fails when some of the counts are zero. Example:

x <- matrix(c(2,5,0,3),2)
AMR::g.test(x)
#> Warning in AMR::g.test(x): `fisher.test()` is always more reliable for 2x2
#> tables and although much slower, often only takes seconds.
#> Warning in AMR::g.test(x): G-statistic approximation may be incorrect due to E <
#> 5
#> 
#>  G-test of independence
#> 
#> data:  x
#> X-squared = NaN, p-value = NA

Thanks for reaching out! I’ll look into it.

I think I found a solution, but I'm curious to hear from your perspective if I did it statistically right.

The calculation of the statistic was:

STATISTIC <- 2 * sum(x * log(x / E), na.rm = TRUE)
# (for stats::chisq.test() this is: sum((abs(x - E) - YATES)^2/E)

where E is the matrix:

#>      [,1] [,2]
#> [1,]  1.4  0.6
#> [2,]  5.6  2.4

# to recreate:
E <- matrix(c(0.3566749, -0.1133287, -Inf, 0.2231436), nrow = 2)

So then removing NAs from the calculation returns a statistic, but the log is of course not working for the -Inf:

# previously
2 * sum(x * log(x / E))
#> [1] NaN
#> Warning message:
#> In log(x/E) : NaNs produced

# with na.rm = TRUE added
2 * sum(x * log(x / E), na.rm = TRUE)
#> [1] 22.48762
#> Warning message:
#> In log(x/E) : NaNs produced

So can we determine the statistic this way, or should we cope a different way with the (negative) infinite value in E?

Yes, that should work I think. The LRT statistic involves the maximum likelihood under a saturated model where all the probabilities are estimated as $\hat p_{ij}=x_{ij}/n$. The likelihood contribution from cells with zero counts is therefore $\hat p_{ij}^{x_{ij}}=0^0=1$ and the log-likelihood contribution is $\log(0^0)=\log(1)=0$ so just removing the term when you get NaN should do the trick.

Yes, exactly. Many thanks for coming back! Fixed in 89a5778.