Normally-distributed random numbers

Question

Normally-distributed random numbers

NevinBR opened this issue 4 years ago · comments

It is common to need a random sample from either a real or complex normal distribution.

The standard library provides static random(in:) methods which sample uniformly, in a constrained extension of BinaryFloatingPoint. (The constraint is RawSignificand: FixedWidthInteger.)

I propose that we add similar functionality for sampling from a normal distribution on Real and Complex with the corresponding constraints.

Stephen Canon · Answer 1 · Sat Feb 01 2020 01:27:42 GMT+0800 (China Standard Time)

Reasonable. I'm on vacation until next week, and probably won't get a chance to write a more detailed response until I return, but a couple quick things to consider:

I plan to provide a random module in SN, including both sampling from the usual distributions and some additional generators beyond what the stdlib has (especially one or more fast seeded generators; I have a branch somewhere with a quick draft of a PCG implementation, for example). Obviously this would be an easy first thing for such a module.
There are some interesting questions around tradeoffs in quality vs speed for these transforms, and also how to design an API that allows improving the transform in ways that may produce better results (i.e. users should be able to explicitly opt-into stable-forever behavior, if they need to get the exact same results for all time, even if we have a better algorithm or fix a bug in the future). There's no really great precedent for what this should look like yet, AFAIK (though some fairly obvious options).
There are a few workloads that fundamentally depend on correctly-rounded samples of distributions--e.g. if the noise added in differential privacy is not a correctly-rounded sample of the appropriate distribution, the entire scheme collapses. So we may want to think about what an API that supports that sort of guarantee would look like (you wouldn't want it to be the default behavior, because it's prohibitively slow for many other uses).

Karl · Answer 2 · Sat Feb 01 2020 23:43:54 GMT+0800 (China Standard Time)

Yes please :)

Here's TensorFlow's version, for reference.

NevinBR · Answer 3 · Sun Feb 02 2020 00:40:04 GMT+0800 (China Standard Time)

I wonder—is sincos optimized enough to make it worthwhile to generate pairs of values from a normal distribution? And how about, um, sincos(piTimes:)?

extension Real {
  static func boxMullerTransform(_ x: Self, _ y: Self) -> (Self, Self) {
    let r = sqrt(-2 * log(x))
    return (r * sin(piTimes: 2*y), r * cos(piTimes: 2*y))
  }
}

If it is well-optimized, then for use-cases that require multiple values a sequence can be built two at a time. And when only one number is needed, a convenience method can provide it.

extension Real where Self: BinaryFloatingPoint, RawSignificand: FixedWidthInteger {
  static func randomNormalPair() -> (Self, Self) {
    return boxMullerTransform(random(in: 0...1), random(in: 0...1))
  }
  
  static func randomNormal(mean: Self = 0, standardDeviation sigma: Self = 1) -> Self {
    return randomNormalPair().0 * sigma + mean
  }
}

extension Complex where RealType: BinaryFloatingPoint, RealType.RawSignificand: FixedWidthInteger {
  static func randomNormal(mean: Self = 0, standardDeviation sigma: RealType = 1) -> Self {
    let (a, b) = RealType.randomNormalPair()
    return Self(a * sigma, b * sigma) + mean
  }
}

Stephen Canon · Answer 4 · Sun Feb 02 2020 01:07:30 GMT+0800 (China Standard Time)

I wonder—is sincos optimized enough to make it worthwhile to generate pairs of values from a normal distribution? And how about, um, sincos(piTimes:).

No need; when the host system libm provides __sincospi (Darwin), LLVM will automatically fuse separate calls to __sinpi and __cospi (Evan Cheng added this optimization way back when I put those functions in the Darwin libm). When we're not using a host system libm function, the optimizer can see both functions and do outlining+CSE to eliminate the redundant operations.

NevinBR · Answer 5 · Sun Feb 02 2020 02:02:33 GMT+0800 (China Standard Time)

My question was not whether we need to write sincos.

My question was whether we should design the API to include a function which generates pairs of normally-distributed values.

Stephen Canon · Answer 6 · Sun Feb 02 2020 02:13:58 GMT+0800 (China Standard Time)

Oh, well, that doesn't require anything of sincos; you still get to benefit from doing half as many sqrt and log operations (sqrt is fast on some recent hardware, but log is about as costly as sin or cos is), so that seems like a no-brainer.

Peter Occil · Answer 7 · Mon Feb 03 2020 03:59:20 GMT+0800 (China Standard Time)

There are a few workloads that fundamentally depend on correctly-rounded samples of distributions--e.g. if the noise added in differential privacy is not a correctly-rounded sample of the appropriate distribution, the entire scheme collapses. So we may want to think about what an API that supports that sort of guarantee would look like (you wouldn't want it to be the default behavior, because it's prohibitively slow for many other uses).

Note that the paper you cite introduces an attack that applies only to floating-point algorithms (more-specifically, floating-point implementations of the Laplacian distribution); as the paper says: "Fixed-point or integer-valued algorithms are immune to our attack". In fact, as I write in "Randomization with Real Numbers", "random non-integer numbers are rarely if ever seen in serious information security applications", and I wasn't much aware of the use of floating-point numbers in differential privacy until today.

I can't think of a security application where random integers (or at most random fixed-point numbers) could not have been used instead of random floating-point numbers.

Also, as for generating normally-distributed random numbers, the Box-Muller transform is only one possibility, and is not the only one that generates pairs of numbers at a time (another is the polar method). Other possibilities include ratio of uniforms, CDF inversion, Karney's algorithm, and ziggurat, which don't necessarily generate pairs of random numbers.