apple / swift-numerics

Advanced mathematical types and functions for Swift

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Hypot method in Real<Double> implementation

Veence opened this issue · comments

As far as I understand, complex modulus is, in pathological cases, computed using the hypot method of the Real+Double file, which scales up Double into Float80 in order to perform the calculation, because of a bug in MacOS libm implementation of hypot.

Since Float80 is significantly slower than Double, wouldn't it be worth adopting the approach described in Numerical Recipes and write this?

  public static func hypot2(_ x: Double, _ y: Double) -> Double {
    guard x.isFinite && y.isFinite else {return .infinity}
    let (m, n) = abs(x) >= abs(y) ? (abs(x), y / x) : (abs(y), x / y)
    return m * sqrt(1 + n * n)
  }

Division is significantly slower than Float80 multiplication, so not really. Also, this method is marginally less accurate, though that's not a big deal in practice. There is an approach along these lines that's competitive (using multiplication by a carefully chosen power of two), but it's still not meaningfully faster than just using Float80.

To give some concrete latency numbers for Skylake to back this up:

The latency chain for using Float80 is FMUL -> FADD -> FSQRT, plus a couple of cycles on each end to convert Double <-> Float80. The latencies are 5c, 3c, and 14-21c, for a total of 22-29c plus conversion overhead.

The latency chain for using the NR algorithm is DIVSD -> MULSD -> ADDSD -> SQRTSD -> MULSD, plus a few cycles of overhead for branch mispredicts. The latencies are 14c, 4c, 4c, 16c, 4c, for a total of 42c plus branch mispredict overhead.

Looking at throughput instead actually tips the balance even further in favor of using Float80, since it has fewer operations and only one operation that's not single-cycle throughput.

Wow. Thanks for your answer! I didn't know division was that much slower. Also, tbh, I had no figures on Float80 arithmetics. Thanks for taking the time to answer!

FWIW you can find most latencies for a wide range of uArches here: https://uops.info/table.html
But, they don't have Float80 instruction latencies; for those Agner Fog's tables (http://agner.org/optimize/instruction_tables.pdf) are the best resource I know of.

Thanks for these awesome resources! I should catch up on the x86 microarchitecture. Last time I did that sort of cycle grained optimisation I was studying MC 68040/56001 code. Geez, I feel old :) Though I lend a hand to Clint Whaley on ATLAS a long time ago.