espadrine / shishua

SHISHUA – The fastest PRNG in the world

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

romutrio is not faster than wyrand

wangyi-fudan opened this issue · comments

I tested romutrio with https://github.com/lemire/testingRNG
It seems that wyrand is still fastest without AVX

We repeat the benchmark more than once. Make sure that you get comparable results.
Generating 65536 bytes of random numbers
Time reported in number of cycles per byte.
We store values to an array of size = 64 kB.

We just generate the random numbers:
xorshift_k4: 1.05 cycles per byte
xorshift_k5: 1.14 cycles per byte
mersennetwister: 1.78 cycles per byte
mitchellmoore: 1.99 cycles per byte
widynski: 1.14 cycles per byte
xorshift32: 1.34 cycles per byte
pcg32: 1.03 cycles per byte
rand: 3.39 cycles per byte
aesdragontamer: 0.44 cycles per byte
aesctr: 0.50 cycles per byte
lehmer64: 0.50 cycles per byte
xorshift128plus: 0.51 cycles per byte
xoroshiro128plus: 0.47 cycles per byte
splitmix64: 0.50 cycles per byte
pcg64: 0.70 cycles per byte
xorshift1024star: 0.93 cycles per byte
xorshift1024plus: 0.60 cycles per byte
romutrio: 0.57 cycles per byte
wyrand: 0.43 cycles per byte

Now, let;s back to the simplest benchmark code:

#include <sys/time.h>
#include
using namespace std;

uint64_t seed=0;
inline uint64_t wyrand(void){
seed+=0xa0761d6478bd642full;
__uint128_t t=(__uint128_t)(seed^0xe7037ed1a0b428dbull)*seed;
return (t>>64)^t;
}

#define ROTL(d,lrot) ((d<<(lrot)) | (d>>(8*sizeof(d)-(lrot))))
uint64_t xState, yState, zState; // set to nonzero seed
uint64_t romuTrio_random () {
uint64_t xp = xState, yp = yState, zp = zState;
xState = 15241094284759029579u * zp;
yState = yp - xp; yState = ROTL(yState,12);
zState = zp - yp; zState = ROTL(zState,44);
return xp;
}

int main(void){
timeval beg, end; uint64_t ret=0, rep=0x10000000;

    gettimeofday(&beg,NULL);
    for(size_t  r=0;    r<rep;    r++)      ret+=wyrand();
    gettimeofday(&end,NULL);
    cerr<<"wyrand\t"<<1e-9*rep/(end.tv_sec-beg.tv_sec+1e-6*(end.tv_usec-beg.tv_usec))<<'\n';;

    gettimeofday(&beg,NULL);
    for(size_t  r=0;    r<rep;    r++)      ret+=romuTrio_random();
    gettimeofday(&end,NULL);
    cerr<<"romutrio\t"<<1e-9*rep/(end.tv_sec-beg.tv_sec+1e-6*(end.tv_usec-beg.tv_usec))<<'\n';;
    return  ret;

}

the result shows wyrand is faster than romutrio:
wyrand 1.31389
romutrio 1.12729

Hello WangYi!
And thanks for wyrand, it is in my top favorite designs!

It is certainly possible that different benchmarks yield different results.
In particular, the benchmark used here relies on functions that fill buffers, while Lemire’s benchmark relies on functions that return uint32_t or uint64_t.

I made the benchmark easy to execute for everyone on the same type of machine to ensure that everyone would get the same results when running it:

shishua/Makefile

Lines 141 to 151 in 4bd8c5c

benchmark-intel: /usr/bin/gcloud
gcloud compute instances create shishua-intel \
--machine-type=n2-standard-2 \
--image-project=ubuntu-os-cloud --image-family=ubuntu-1910 \
--zone=us-central1-f \
--maintenance-policy=TERMINATE
tar cJf shishua.tar.xz $$(git ls-files)
gcloud compute scp ./shishua.tar.xz shishua-intel:~ --ssh-key-file=$(SSH_KEY)
rm shishua.tar.xz
gcloud compute ssh shishua-intel --ssh-key-file=$(SSH_KEY) -- 'tar xJf shishua.tar.xz && ./gcp-perf.sh'
gcloud compute instances delete shishua-intel

The results seem fairly consistent; the digits published in the readme were identical across runs on different days and hours.

But I’d like to investigate wyrand’s performance. If you have insights, they are welcome!
In particular, you mention wyrand is still fastest without AVX – do you think the order could be reversed if removing AVX? Do you mean at the compiler level, or at the CPU level?
In which case, I could add a note on the readme.