apple / swift-numerics

For example, in binary32:

x = 1.09590280055999755859375
y = 1.230000019073486328125

The real part of e^{x + iy} - 1, computed in high precision with Sollya and rounded to binary32, is

-3.4550861727211668039672076702117919921875e-8,

but the expression fma(expm1(x),cos(y), -versin(y)) as used at

swift-numerics/Sources/ComplexModule/ElementaryFunctions.swift

Line 117 in c26e60c

RealType._mulAdd(.cos(z.y), .expMinusOne(z.x), .cosMinusOne(z.y)),

gives (rounding each subexpression to binary32 in Sollya)

-6.9330411633927724324166774749755859375e-8,

in which every digit is wrong.

The problem is catastrophic cancellation when e^x cos y ~ 1. Factoring it to compute the addition with FMA still trips over catastrophic cancellation because (e^x - 1) cos y ~ versin y if e^x cos y ~ 1. Same goes for the other obvious factorings as e^x - 1 - e^x versin y, or e^{x + log cos y} - 1 = e^{x + log(1 - versin y)} - 1, although they do worse when x >>> 1 because the low resolution of floating-point values around y ~ acos(e^{-x}) gives no more opportunity for cancellation of versin y against e^x cos y once y = fl(pi/2).

I don't see any clear way to compute the error in either term short of computing the transcendental functions in extended precision. I guess that might be easy for binary32, but in binary64 you'd presumably have to do the transcendental functions in double-Dekker arithmetic. Maybe there's a clever algebraic trick to avoid the subtraction altogether, but I haven't thought of one.

Nit: There appears to be a typo in the comment at:

swift-numerics/Sources/ComplexModule/ElementaryFunctions.swift

Line 86 in c26e60c

// exp(x) cosMinuxOne(y) + expMinusOne(y)

It reads

exp(x) cosMinuxOne(y) + expMinusOne(y)

but it should read

exp(x) cosMinusOne(y) + expMinusOne(x)

Yup. This is a problem that no one has a good solution to yet. I have a few ideas, but haven't had time to sort through them.

As you suggest, the cosm1 factorization is strictly better than the expm1 factorization (it at least gets an absolute error bound for this case, which is why I'm using it). This is discussed superficially in the updated comments here: #152 (which also fixed the typo you caught).

short of computing the transcendental functions in extended precision

There's not even an easy proof that this suffices--how close can (x,y) be to exp(-x) = cos(y)? This is non-trivial (though there's good reason to believe you won't get much closer than random luck would provide, so using a format twice wider should suffice), and you need to know in order to bound the precision required. For any fixed precision, one could use tables to store base points (x,y) that are unusually close to the curve, and expand around them, but that doesn't work for an implementation that supports generic real types.

short of computing the transcendental functions in extended precision

There's not even an easy proof that this suffices--how close can (x,y) be to exp(-x) = cos(y)? This is non-trivial, and you need to know in order to bound the precision required.

I didn't think very hard about how doing it in extended precision would actually work; mostly I just wanted an excuse to say 'double-Dekker arithmetic'.

I have some thoughts on this, not sure if it’ll help though.

First, if cos(y) is negative or zero then we know the real part of the result will be at most -1, hence not near 0, so we can use a simple calculation.

Second, the problem curve approaches cos(y) = 0 as x increases, so cosMinusOne(y) is actively harmful in that region for large x.

Third, when x is small, the problem curve approaches (exp(x), cos(y)) = (1, 1), so both expMinusOne(x) and cosMinusOne(y) could plausibly be useful. One way to utilize both is to notice that a*b - 1 = (a-1)*(b-1) + (a-1) + (b-1). That means we can go:

let e = .expMinusOne(x)
let c = .cosMinusOne(y)
let u = c*e + c + e        // Or similar with augmented sum / fma / etc.

An entirely different approach would be to look at the distance between (x, y) and the problem curve. This distance could be measured horizontally, vertically, or perhaps even along the gradient of cos(y) - exp(-x). Horizontal is easiest:

When cos(y) is positive, the horizontal distance from the curve to the point is given by d = x + log(cos(y)). Of course, that involves catastrophic cancellation near the curve, so maybe it doesn’t buy us much.

Ignoring that issue, if cos(y) is close to 1 we might rewrite as d = x + log(onePlus: cosMinusOne(y)). In either case, the result we want is just u = expMinusOne(d). I’m not sure there’s an easy way around the cancellation issue though, and when cos(y) is close to 1 we’d probably rather use a different approach anyway.

Second, the problem curve approaches cos(y) = 0 as x increases, so cosMinusOne(y) is actively harmful in that region for large x.

Well, no, because the alternative factorization (exp(x) cosMinuxOne(y) + expMinusOne(x)) has terms that get arbitrarily large, while the terms in the factorization we're using stay bounded near 1. So it's not what we want, but it's far better than the only alternative we have available currently.

Note that this is pretty easy to make rigorous; we're interested in comparing the magnitude of f(x) = exp(x) - 1 and g(y) = cos(y) - 1 along the curve exp(x)cos(y) = 1. If we solve for y we get:

g(y(x)) = cos(acos(exp(-x)) - 1 = exp(-x) - 1

Which is always smaller in magnitude than f(x) when x > 0.

To your third point, the final answer is still very nearly zero, so that expression is just going to sum the rounding errors in cosm1 and expm1, no matter how carefully you sum the terms--the fundamental problem remains that you need an extra-precise cos and exp function (and we can't even bound the number of digits you might need--there is some representable point closest to the curve, which I can compute by exhaustive testing for Float, but not for larger types).

Note that even just computing the distance to the curve requires ... an extra precise cosine and exp, barring some very clever trick (because of the same cancellation issue). This is a fun research problem, but quite non-trivial.

My point is that for large enough x, using expMinusOne doesn’t gain any precision, and if we are also close to the curve then using cosMinusOne actually loses precision. In that situation neither factorization is helpful, so the naive calculation u = exp(x) * cos(y) - 1 should be just as good and maybe better.

In other places, like when (x, y) is close to (0, 0), there is something to gain by using both expMinusOne(x) and cosMinusOne(y), so perhaps the a*b + a + b dual factorization might outperform either of the single factorizations.

Of course for z extremely close to 0 + 0i, the nearest representable value to expMinusOne(z) is simply z. I don’t know if it helps to special-case that, or just let it fall out naturally from the calculation.

Basically, what I’m suggesting is to partition the plane into some number of regions, and use whichever approach is most accurate in each.

I could be mistaken, but I believe the threshold for expMinusOne to gain precision is when 1/2 < exp(x) < 2. So if x > log(2) we’re fine going with exp.

Similarly, the threshold for cosMinusOne to gain precision should be cos(y) > 1/2.

Right, exp(x) * cos(y) - 1 is about as good on that region, but not appreciably better. The expMinusOne alternative is always strictly worse. So there's no reason to use either of those on any region of the plane. You're always at least as well off using what we have now, and often better off.

In an additive expression with cancellation, the error comes only from the rounding of the terms that are cancelling. If you look at what we use currently:

expMinusOne(x) cos(y) + cosMinusOne(y)

Assuming a good-quality host math library, expMinusOne, cos, and cosMinusOne are all computed with small relative error (ideally sub-ulp, a couple of ulp at worst). So we are evaluating an expression of the form a(1+δ₁) + b(1+δ₂) where a and b are the exact real values of (e^x-1)cos(y) and cos(y)-1, respectively and the deltas are bounded by a small multiple of epsilon. Because we're interested in the region close to the critical curve, the addition is computed exactly, so the result is precisely Re(exp(z)-1) + δ₁a + δ₂b, and so the absolute error is bounded by (|a|+|b|)kε for some small k.

Note that this analysis holds for all of the expressions you consider, except the "dual" factorization; the way we minimize the error bound is to minimize |a| and |b|. Among the alternatives considered, the one in use minimizes them for all input values, not just in some region. There's a small note to make for exp(x)cos(y) - 1, because 1 is exact and therefore does not contribute to the error, but that buys you at most a bit or two in practice in this case, hardly worth the added complexity when you still have essentially no good bits in the result (it would be likely to be worth using in an arbitrary-precision implementation, however).

The dual factorization is actually worse in a lot of ways, because the cancellation doesn't necessarily happen in a single shot, so you have to be careful about how you order the terms (and/or use a compensated sum in some form), and you're still running into the fundamental cancellation issue anyway.

Forgot to address z very nearly 0 😀

In this case, expMinusOne(x) cos(y) + cosMinusOne(y) becomes x*1 + y^2/2 = x, so there's no loss of accuracy in just using the expression we have.

In an additive expression with cancellation, the error comes only from the rounding of the terms that are cancelling. If you look at what we use currently:
expMinusOne(x) cos(y) + cosMinusOne(y)
Assuming a good-quality host math library, expMinusOne, cos, and cosMinusOne are all computed with small relative error (ideally sub-ulp, a couple of ulp at worst). So we are evaluating an expression of the form a(1+δ₁) + b(1+δ₂) where a and b are the exact real values of (e^x-1)cos(y) and cos(y)-1, respectively and the deltas are bounded by a small multiple of epsilon. Because we're interested in the region close to the critical curve, the addition is computed exactly, so the result is precisely Re(exp(z)-1) + δ₁a + δ₂b, and so the absolute error is bounded by (|a|+|b|)kε for some small k.

I’ll trust your expertise here, but it still seems strange to me that we would intentionally calculate cosMinusOne when cos is close the zero.

It's no less accurate than the alternatives we have (as explained), not significantly slower, and yields considerable simplicity in the implementation. What's not to like?

Real part of complex expMinusOne is bad when e^{-x} ~ cos y