herumi / fmath

Hi @herumi, 🙂

Could you comment in more detail how the expd function (on fmath.hpp) works? I have tried to understand the flow explained in https://github.com/herumi/fmath/blob/master/algo-ja.md, but it does not clear any of my doubts. 😅

Sorry for these questions, but I am not very familiar with unions and binary operations, so this expd function is kind of difficult to unfold for me. From what I have been able to understand, first you store the values of powers of two from 0 to 1 (2^0 <> 2^1) in a lookup table (ExpdVar c.tbl).

fmath/fmath.hpp

Lines 177 to 182 in 0a10069

    
           	for (int i = 0; i < s; i++) { 
        
           		di di; 
        
           		di.d = ::pow(2.0, i * (1.0 / s)); 
        
           		tbl[i] = di.i & mask64(52); 
        
           	} 
        
           }

Then, this lookup table is used in the expd function. Let me know if this is correct.
However, I was not able to follow the rest of the operations. More specifically:

What is the purpose of the variable b = 3ULL << 51?
Why do you calculate di.d = x * c.a + b ?
What does the variable iax represent?
Should not the value of t always be zero? I suppose this has something to do with floating numbers, since the equation, with real numbers, should simplify to zero.
What does the variable u represent? This computation is quite hard to understand (to a newbie like me)😭
Finally, I suppose the value of y is the evaluation of a polynomial, but I do not know what it is exactly representing.
And the final two operations (binary OR and the product of y with di.d) also no idea.

fmath/fmath.hpp

Lines 474 to 484 in 0a10069

    
           	const uint64_t b = 3ULL << 51; 
        
           	di di; 
        
           	di.d = x * c.a + b; 
        
           	uint64_t iax = c.tbl[di.i & mask(c.sbit)]; 
        
           	double t = (di.d - b) * c.ra - x; 
        
           	uint64_t u = ((di.i + c.adj) >> c.sbit) << 52; 
        
           	double y = (c.C3[0] - t) * (t * t) * c.C2[0] - t + c.C1[0]; 
        
           	di.i = u | iax; 
        
           	return y * di.d;

I would appreciate very much if you could comment the overall picture for computing expd, and if it is also possible 🙏, a more detailed breakdown of each line in expd.

Thank you very much for your time.

For any x = s + t, exp(x) = exp(s)exp(t).
Suppose that the exp(s) can be computed by a table lookup and exp(t) with a small t is computed by a Maclaurin series.
I want to compute exp(t) by 1 + t + t^2/2 + t^3/6.
The resolution of double is 1e-16, then I expect that |t| is smaller than 1/2^12 = 1/4096.

Let x' = x * a. Split x' = n + t where n=round(x') is an integer and t is a fraction (|t|<=1/2).
Then x = (n/a) + (t/a).
If a=2048/log(2)~2954.6... then |t/a| < 1/6000.

exp(n/a) = e(n/2048 * log(2)) = 2^(n/2048) = 2^q 2^r where q = int(n/2048), r = n mod 2048.

2^q can be computed by bit shift, so we apply a table lookup to 2^r.

How to compute round(x').

There are some ways.

FPU
cvtsd2si (SSE)
roundpd (SSE4.1/AVX)
vrndscaleps (AVX-512)

I selected 1. because I programmed it a very long time ago.

The format of double is sign(1 bit) + exponent(11 bit) + fraction (52 bit).
See https://en.wikipedia.org/wiki/Double-precision_floating-point_format .

If x is added a large value then the fraction is rounded.

For exampe,

12.25 + 2^52 = 2^52 * (1 + 2.72..10^(-15))
             = 2^52 * (1 + 12/2^52)

Add b = (2^52 + 2^51) = 3ULL << 51 to account for the fact that x is negative.
This is the first magic number.

To be continued...

To extract a fraction of a double value, use union.

union di {
	double d;
	uint64_t i;
};

tbl[di.i & 2047] means 2^r by a table lookup.
2^r(0 <= r < 2048) is in [1, 2) then take only a fraction part.

for (int i = 0; i < s; i++) {
    di di;
    di.d = ::pow(2.0, i * (1.0 / s));
    tbl[i] = di.i & mask64(52);     // here
}

By the way, a table lookup is inconvenient for SIMD, so I think that https://github.com/herumi/simdgen/blob/main/algorithm.md is better algorithm.

Hi @herumi !🤗

Thank you so much for your rapid and detailed answer. 👏👏👏
After a thorough review, I think I comprehended all of the information you provided in the comments.
For the time being, all of my concerns have been addressed; if I have any further questions, I would be awesome to contact you again.

Thanks!😊

	for (int i = 0; i < s; i++) {
	di di;
	di.d = ::pow(2.0, i * (1.0 / s));
	tbl[i] = di.i & mask64(52);
	}
	}

	const uint64_t b = 3ULL << 51;
	di di;
	di.d = x * c.a + b;
	uint64_t iax = c.tbl[di.i & mask(c.sbit)];

	double t = (di.d - b) * c.ra - x;
	uint64_t u = ((di.i + c.adj) >> c.sbit) << 52;
	double y = (c.C3[0] - t) * (t * t) * c.C2[0] - t + c.C1[0];

	di.i = u \| iax;
	return y * di.d;

Explanation about expd