JuliaMath / SpecialFunctions.jl

A performance issue when running many iterations of the gamma function is the internal allocation of a 30-length buffer.

A solution I've implemented is to overwrite the functions to use 1) StaticArrays.MVector, and 2) A per-thread buffer pool. However, I don't know if this is the best solution.

const gamma_inc_taylor_buffers = map(1:nthreadlimit) do i
    @MVector zeros(30)
end

function _gamma_inc_get_buffer()
    buffer = gamma_inc_taylor_buffers[Threads.threadid()]
    buffer .= 0
    return buffer
end

# in functions
wk = _gamma_inc_get_buffer()

It would be better to have a solution in here, to avoid recompiling.

4.814791 seconds (14.04 M allocations: 3.806 GiB, 18.75% gc time) # Current

4.496342 seconds (14.04 M allocations: 3.217 GiB, 16.10% gc time) # `@MVector zeros(30)`

3.123370 seconds (869.47 k allocations: 78.309 MiB) # Buffer pool + MVector

Which function are you talking about specifically? gamma really shouldn't be allocating at all but an approach using StaticArrays is probably not needed. Should probably use a ntuple as I doubt this package would ever pull a dependency on StaticArrays as so many other packages depend on SpecialFunctions.

Sorry, it's in the title: gamma_inc_taylor, gamma_inc_asym.

My guess is that the purpose is to add up the components from smallest to largest to increase floating point accuracy.

Ahh I see (

SpecialFunctions.jl/src/gamma_inc.jl

Line 424 in ae35d10

wk = zeros(30)

) is the issue. I think this is just an issue that the original implementation was a translation of the Fortran routines (with stack allocated arrays) and so this is just not a good way to implement this in Julia. The way you sum them up though shouldn't matter as the terms should be strictly decreasing where these algorithms are employed.

Honestly, if I could figure out what math expression it was actually implementing I'd just rewrite it but it seems like it's doing something a bit more than the dlmf link.

Yeah.

So, the wrinkle for floating point series like this is that while the terms should strictly decrease, at some point they start to become unreliable - increasing, going negative, blowing up, etc. So you need to iterate one at a time and test whether the terms still make sense. And the number of terms at which it breaks down will be variable.

This loop is building up the terms, checking whether the term has a sensible magnitude, and storing it.

SpecialFunctions.jl/src/gamma_inc.jl

Lines 426 to 442 in ae35d10

    
           apn = a + 1.0 
        
           t = x/apn 
        
           wk[1] = t 
        
           loop = 2 
        
           for indx = 2:20 
        
              apn += 1.0 
        
              t *= x/apn 
        
              if t <= 1.0e-3 
        
                  loop = indx 
        
                  flag = true 
        
                  break 
        
              end 
        
              wk[indx] = t 
        
           end 
        
           if !flag 
        
               loop = 20 
        
           end

And, to preserve floating point accuracy, you add them up in reverse order, small to large.

SpecialFunctions.jl/src/gamma_inc.jl

Lines 453 to 455 in ae35d10

    
           for j = loop-1:-1:1 
        
              sm += wk[j] 
        
           end

The middle loop is just adding up the smaller terms directly, I guess the accuracy there wasn't deemed as important to the overall result.

at some point they start to become unreliable - increasing, going negative, blowing up, etc

I think my point is that it is inefficient to do this at runtime and not needed. It should be very predictable when this occurs depending on if you are looking at convergent or divergent series. We will know if the series suffers from cancellation beforehand or not and can set up the algorithm appropriately. Checking all these things at runtime is not very efficient and it shouldn't be required to store all these values. If the series is decreasing (or positive) it will be just as accurate to sum k =0...inf than summing over reverse order. If the series is diverging then it won't matter how we sum it up we will need to use fancier methods like sequence transformations to sum it accurately which is something differnet.

I think the unfortunate thing is that it is unclear to me what these two functions are computing because they point to the same nist link. I'm sure I'd have to read the paper but there also isn't an equation ref number either 🤷‍♂️

it will be just as accurate to sum k =0...inf

This is not true for floating point addition though. The precision of the smaller numbers gets swallowed by the larger numbers.

gamma_inc(11, 9)
# Correct answer: (0.2940116796594886, 0.7059883203405114)
# Just adding up all terms from large to small: (0.39103174184831896, 0.6089682581516811)

I must admit the code itself is a bit of a Chesterton's Fence situation for me but having dealt with a lot of series like this I've found it's no more accurate to sum in this way. Of course from a floating point standpoint strict equality will not be met if you sum them differently but usually they are similar to <1 ULP. If they are not then the series representation should probably not be used in that domain.

I have no idea what you are doing in your example but just re-implementing their commented example...

function g(a, z::T) where T
    MaxIter = 5000
    t = one(T)
    s = zero(T)
    for i in 1:MaxIter
        s += t
        abs(t) < eps(T) * abs(s) && break
        t *= z / (a + i)
    end
    p = (SpecialFunctions.rgammax(a, z)/a) * s
    return (p, 1.0 - p)
end

julia> g(11.0, 9.0)
(0.29401167965948855, 0.7059883203405115)

julia> gamma_inc_taylor(11.0, 9.0, 0)
(0.2940116796594886, 0.7059883203405114)

Now, I am not for sure why the original idea (the fence) to do it this way came about so I'm not saying this is the right way to do it. But I think my original point stands for these types of series that summing in reverse order is not usually needed.

Alright, yeah, your version works and passes all the tests. Fair enough!

	apn = a + 1.0
	t = x/apn
	wk[1] = t
	loop = 2
	for indx = 2:20
	apn += 1.0
	t *= x/apn
	if t <= 1.0e-3
	loop = indx
	flag = true
	break
	end
	wk[indx] = t
	end
	if !flag
	loop = 20
	end

Reduce/remove inner allocations (e.g. `gamma_inc_taylor`, `gamma_inc_asym`)