Notes: 32-bit implementation and perf optimization

Question

Notes: 32-bit implementation and perf optimization

mgold opened this issue 8 years ago · comments

Tests run on mgold/elm-nonempty-list test suite commit 9fe70d9 unless otherwise noted. Unreleased PCG versions were hand-copied into elm-stuff.

RNG	time to run nonemptylist suite (seconds)	reason for improvement
PCG 2.1.1 release	3.2+ (up to 4)	(baseline)
PCG bit32 branch	2.3 - 2.5	removal of 64-bit arithmetic and complicated `peel` implementation
PCG bit32 with hand-optimized JS	1.7	bitwise ops and `%` no longer buried in function calls
PCG stub branch	0.2	Complete removal of all RNG code; represents overhead of test suite
core Random, `elm-check`, `c8c83c2`	0.03	generator is statistically unsound, uses no bitwise ops. Theories below.
core Random, `elm-check`, `c8c83c2` with property-based tests removed	0.02	There is no way all those property tests ran in 10ms. Delta likely represents parsing their results.
core Random, project fuzzball, `85228d5`	1.8 - 2.2	core Random is not ridiculously fast when actually measured properly

Observations: With project-fuzzball, there's about 200ms of overhead and either RNG adds another 1.5 to 2 seconds. With elm-check, it takes 30ms.

I was really confused as to what Hassan was doing until I realized that the 30ms figure doesn't include the time to run the property-based tests! It's a completely invalid comparison!

The currently shipping PCG is about twice as slow as core/Random, but PCG-32 is only slightly slower, probably good enough to use. Plus, it will get even faster if bitwise op inlining is ever implemented by the compiler.

One possible flaw is that PCG-32 was measured against project-fuzzball/test version 1.x while core was against 5.x. Were there any major performance changes in the interceding versions?

Hand optimized JS:

$ diff index.html index.html.patched
10507,10522c10507,10509
<   var _p2 = _p1;
<   var _p3 = _p2._0;
<   var word = A2(
<       _elm_lang$core$Bitwise$xor,
<       A2(
<           _elm_lang$core$Bitwise$shiftRightLogical,
<           _p3,
<           A2(_elm_lang$core$Bitwise$shiftRightLogical, _p3, 28) + 4),
<       _p3) * 277803737;
<   return A2(
<       _elm_lang$core$Bitwise$shiftRightLogical,
<       A2(
<           _elm_lang$core$Bitwise$or,
<           A2(_elm_lang$core$Bitwise$shiftRightLogical, word, 22),
<           word),
<       0);

---
>     var _p2 = _p1._0;
>     var _word = ((_p2 >>> ((_p2 >>> 28) + 4)) ^ _p2) * 277803737;
>     return ((_word >>> 22) ^ _word) >>> 0;
10722,10724c10709
<   var _p58 = _p57;
<   return _mgold$elm_random_pcg$Random_Pcg$Seed(
<       A2(_elm_lang$core$Bitwise$shiftRightLogical, (_p58._0 * 1664525) + 1013904223, 0));

---
>   return _mgold$elm_random_pcg$Random_Pcg$Seed( ((_p57._0 * 1664525) + 1013904223) >>> 0 );
10747,10749c10732
<               if (_elm_lang$core$Native_Utils.eq(
<                   A2(_elm_lang$core$Bitwise$and, range, range - 1),
<                   0)) {

---
>               if (_elm_lang$core$Native_Utils.eq(range & (range - 1), 0)) {
10752,10758c10735
<                       _0: A2(
<                           _elm_lang$core$Bitwise$shiftRightLogical,
<                           A2(
<                               _elm_lang$core$Bitwise$and,
<                               _mgold$elm_random_pcg$Random_Pcg$peel(seed0),
<                               range - 1),
<                           0),

---
>                       _0: (_mgold$elm_random_pcg$Random_Pcg$peel(seed0) & (range - 1)) >>> 0,
10762,10768c10739
<                   var threshhold = A2(
<                       _elm_lang$core$Bitwise$shiftRightLogical,
<                       A2(
<                           _elm_lang$core$Basics$rem,
<                           A2(_elm_lang$core$Bitwise$shiftRightLogical, 0 - range, 0),
<                           range),
<                       0);

---
>                   var threshhold = ((-range >>> 0) % range) >>> 0;
10781c10752
<                                   _0: A2(_elm_lang$core$Basics$rem, x, range) + lo,

---
>                                   _0: (x % range) + lo,
13875c13846

Max Goldstein · Answer 1 · Sun Jul 31 2016 08:09:42 GMT+0800 (China Standard Time)

Raw benchmark numbers:

$ head -n 50 *.log
==> core.log <==
flip a coin x 826,532 ops/sec ±3.23% (52 runs sampled)
flip 1000 coins x 833 ops/sec ±2.69% (55 runs sampled)
generate an integer 0-4094 x 790,682 ops/sec ±1.63% (55 runs sampled)
generate an integer 0-4095 x 788,581 ops/sec ±1.49% (58 runs sampled)
generate an integer 0-4096 x 780,343 ops/sec ±1.57% (53 runs sampled)
generate a massive integer x 500,936 ops/sec ±1.30% (56 runs sampled)
generate a percentage x 359,647 ops/sec ±1.09% (58 runs sampled)
generate 1000 percentages x 466 ops/sec ±1.13% (54 runs sampled)
generate a float 0-4094 x 413,948 ops/sec ±1.15% (57 runs sampled)
generate a float 0-4095 x 415,087 ops/sec ±1.23% (56 runs sampled)
generate a float 0-4096 x 417,024 ops/sec ±1.38% (56 runs sampled)
generate a massive float x 358,135 ops/sec ±1.01% (58 runs sampled)

==> pcg32-js.log <==
flip a coin x 6,849,402 ops/sec ±1.17% (56 runs sampled)
flip 1000 coins x 6,699 ops/sec ±1.80% (37 runs sampled)
generate an integer 0-4094 x 1,199,519 ops/sec ±1.35% (56 runs sampled)
generate an integer 0-4095 x 1,580,108 ops/sec ±1.23% (55 runs sampled)
generate an integer 0-4096 x 1,265,778 ops/sec ±1.31% (56 runs sampled)
generate a massive integer x 1,495,435 ops/sec ±1.10% (55 runs sampled)
generate a percentage x 1,952,226 ops/sec ±1.05% (56 runs sampled)
generate 1000 percentages x 5,983 ops/sec ±0.94% (33 runs sampled)
generate a float 0-4094 x 3,224,260 ops/sec ±0.95% (56 runs sampled)
generate a float 0-4095 x 3,235,686 ops/sec ±1.14% (54 runs sampled)
generate a float 0-4096 x 3,023,070 ops/sec ±1.29% (57 runs sampled)
generate a massive float x 1,732,447 ops/sec ±0.93% (58 runs sampled)

==> pcg32.log <==
flip a coin x 3,130,551 ops/sec ±1.05% (55 runs sampled)
flip 1000 coins x 3,233 ops/sec ±1.12% (21 runs sampled)
generate an integer 0-4094 x 839,203 ops/sec ±1.03% (56 runs sampled)
generate an integer 0-4095 x 1,105,071 ops/sec ±1.68% (57 runs sampled)
generate an integer 0-4096 x 843,090 ops/sec ±0.78% (56 runs sampled)
generate a massive integer x 1,129,568 ops/sec ±1.44% (56 runs sampled)
generate a percentage x 1,363,954 ops/sec ±1.50% (56 runs sampled)
generate 1000 percentages x 2,056 ops/sec ±1.10% (57 runs sampled)
generate a float 0-4094 x 1,519,190 ops/sec ±1.11% (57 runs sampled)
generate a float 0-4095 x 1,452,386 ops/sec ±0.99% (57 runs sampled)
generate a float 0-4096 x 1,511,441 ops/sec ±1.24% (55 runs sampled)
generate a massive float x 1,285,961 ops/sec ±1.38% (57 runs sampled)

==> pcg64.log <==
flip a coin x 596,134 ops/sec ±1.19% (53 runs sampled)
flip 1000 coins x 457 ops/sec ±1.23% (54 runs sampled)
generate an integer 0-4094 x 272,902 ops/sec ±2.40% (54 runs sampled)
generate an integer 0-4095 x 332,912 ops/sec ±1.66% (55 runs sampled)
generate an integer 0-4096 x 289,518 ops/sec ±1.46% (55 runs sampled)
generate a massive integer x 339,060 ops/sec ±1.37% (55 runs sampled)
generate a percentage x 216,582 ops/sec ±1.36% (56 runs sampled)
generate 1000 percentages x 283 ops/sec ±1.27% (57 runs sampled)
generate a float 0-4094 x 255,547 ops/sec ±1.58% (54 runs sampled)
generate a float 0-4095 x 257,823 ops/sec ±1.42% (58 runs sampled)
generate a float 0-4096 x 239,220 ops/sec ±2.25% (58 runs sampled)
generate a massive float x 203,079 ops/sec ±1.45% (57 runs sampled)

Max Goldstein · Answer 2 · Sun Jul 31 2016 08:15:16 GMT+0800 (China Standard Time)

@evancz @rtfeldman

This table shows operations per second normalized to core's implementation as 1. >1 means faster than core by that factor. The columns represent the old version of PCG, the new algorithm, and the new algorithm with hand-optimized JS.

test name	PCG 64	PCG 32	PCG 32-JS
flip a coin	0.72	3.79	8.29
flip 1000 coins	0.55	3.88	8.04
generate an integer 0-4094	0.35	1.06	1.52
generate an integer 0-4095	0.42	1.40	2.00
generate an integer 0-4096	0.37	1.08	1.62
generate a massive integer	0.68	2.25	2.99
generate a percentage	0.60	3.79	5.43
generate 1000 percentages	0.61	4.41	12.84
generate a float 0-4094	0.62	3.67	7.79
generate a float 0-4095	0.62	3.50	7.80
generate a float 0-4096	0.57	3.62	7.25
generate a massive float	0.57	3.59	4.84

Takeaway: The new version of PCG will always be faster than core, even without the compiler inlining bitwise ops and rem.

Tiny caveat: I haven't implemented independent seeds (aka splitting) yet, which I expect to slow things down slightly. Unless we want to add that feature to core or want to take the (small) perf hit for being able to do so later easily, it won't be included in a PR to core.

But, that doesn't need to happen right away. If 0.18 is going to be about breaking things now instead of later, this can be delayed because it doesn't change the interface. At all. One of the side benefits of only 32 bits of state is that initialSeed2 is no longer necessary. It can literally be added in a patch release.

Richard Feldman · Answer 3 · Sun Jul 31 2016 09:18:55 GMT+0800 (China Standard Time)

This is really great! 🚀 Awesome work @mgold!

I'd say including independentSeed now makes sense, since we already have an important concrete use case for it: elm-test.

Max Goldstein · Answer 4 · Sun Jul 31 2016 14:13:39 GMT+0800 (China Standard Time)

Okay, I've implemented independentSeed but haven't completely tested it yet. Preliminarily, the perf hit is basically within the noise, as expected.

Max Goldstein · Answer 5 · Wed Aug 03 2016 12:07:18 GMT+0800 (China Standard Time)

Implemented in 3.0.0.