statrs-dev / statrs

Hey all, I noticed a small discrepancy between floating point precision of the negative binomial CDF when compared to scipy. For very unlikely observations it seems like statrs is rounding down to zero even though the precision should be within range of f64 precision. I suspect there's some rounding going on somewhere in the regularized incomplete beta function calculation when under some set precision but haven't explored too deeply yet.

Here's a minimal reproducible example of the rust code and the python code for comparison of results.

use statrs::{distribution::{NegativeBinomial, DiscreteCDF}, function::beta::beta_reg};

fn main() {
    let (x, r, p) = (2321., 269., 0.0405);
    let cdf = NegativeBinomial::new(r, p).unwrap().cdf(x as u64);
    println!("{:?}", cdf);
}

which results in 0.0

and the scipy example which under the hood pipes to boost ibeta

from scipy.stats import nbinom

x, r, p = 2321, 269, 0.0405
cdf = nbinom.cdf(x, r, p)
print(cdf)

which results in 2.803588112796817e-43

For visualization this is what the parameterized distribution and associated observation looks like - you can see this is incredibly unlikely so ultimately the value is incredibly close to zero, but since the f64 minimum positive is 2.2250738585072014E-308f64 it should still be within range of precision.

The value 2.80358811[...]e-43 can be independently verified with Wolfram Alpha.

The problem is here:

statrs/src/distribution/negative_binomial.rs

Line 114 in 0c8fb1a

1.0 - beta::beta_reg(x as f64 + 1.0, self.r, 1.0 - self.p)

beta_reg is the regularized lower incomplete beta function. For the given parameters, the 64 bit floating point result of that beta_reg call is 1.0, so when it is subtracted from 1.0 the result is 0; the difference that we wanted is lost.

By using the symmetry of the function (see, for example, Incomplete beta function - Properties), that expression can be changed to

    beta::beta_reg(self.r, x as f64 + 1.0, self.p)

(which also nicely avoids the other precision-killing subtraction 1.0 - self.p). If I make that change, all the existing tests pass, and the cdf method with x=2321, r=269, p=0.0405 returns 2.803588112801757e-43, which agrees with the true value with a relative error less than 1e-11.

Not sure what the scope of this project is - but this sort of problem makes me think that having a dedicated survival function may be a good idea to avoid potential precision-killing subtraction such as the method above - since a 1-cdf would likely be used post hoc by the user to calculate it when using these distributions in practice.

It could potentially be included on the ContinuousCDF and DiscreteCDF traits? Would be happy to try to work on this if there is interest.

... having a dedicated survival function may be a good idea to avoid potential precision-killing subtraction

I agree, a dedicated survival function (where possible) is important to have. This is a lesson we learned in SciPy. Using 1 - cdf(x) for the survival function (also known as the complementary CDF) kills all precision in the right tail. A dedicated inverse survival function should also be an option.

Negative binomial cdf floating point precision not within range of f64 minimum positive.