mratsim / laser

The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Optimised random sampling methods

mratsim opened this issue · comments

Random sampling methods for common distributions are required.

Normal distribution:

Currently Arraymancer uses Box-Muller: https://github.com/mratsim/Arraymancer/blob/d05ef61847601fe253837329e0c8c47be18e8d9d/src/tensor/init_cpu.nim#L209-L235

proc randomNormal(mean = 0.0, std = 1.0): float =
  ## Random number in the normal distribution using Box-Muller method
  ## See https://en.wikipedia.org/wiki/Box%E2%80%93Muller_transform
  var valid {.global.} = false
  var x {.global.}, y {.global.}, rho {.global.}: float
  if not valid:
    x = rand(1.0)
    y = rand(1.0)
    rho = sqrt(-2.0 * ln(1.0 - y))
    valid = true
    return rho*cos(2.0*PI*x)*std+mean
  else:
    valid = false
    return rho*sin(2.0*PI*x)*std+mean

The polar method and Ziggurat algorithm are apparently faster.

Reading

https://fac.ksu.edu.sa/sites/default/files/introduction-to-probability-model-s.ross-math-cs.blog_.ir_.pdf

I can give some feedback here. 8 years ago or so, I timed C versions of Ziggurat as much faster than Box-Muller for normal deviates on Intel Nehalem..About 25 clock cycles. Marsaglia's original paper gets about 400 Mhz/15 Mps = 27 cycles. I haven't timed it lately, but "around 20-30" cycles seems pretty likely. That's for float32 precision unit normals and when you can keep 3 different 512B tables in L1. Often one generates a big batch of numbers and then uses them, though. So, cache competition is probably not hard to address in practice.