daqana / dqrng

Fast Pseudo Random Number Generators for R

Home Page:https://daqana.github.io/dqrng/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to use dqrng to make RcppParallel code reproducible

heatherjzhou opened this issue · comments

Hi,

Thank you for making this tool. It is one of the only available options that could make my RcppParallel code reproducible.

The only instruction I can find on how to use dqrng to make RcppParallel reproducible is the last section of this: https://www.daqana.org/dqrng/articles/parallel.html. This example uses auto, std::ref(), and std::generate(). However, in the program I am writing, the operator() in RcppParallel::Worker calls another function that I wrote (which uses RcppArmadillo) in each iteration. This function is quite complicated and calls other helper functions. In some of these helper functions, I perform random sampling; in other helper functions, I make draws from a normal distribution.

My question is: how do I use dqrng to make RcppParallel reproducible without using std::ref() and std::generate(), when I need to draw from different distributions in different helper functions? Specifically, what is the syntax for random sampling and drawing from a normal distribution, since the syntax from the other instructions don't seem to apply? Also, I want to provide the user of my program the option to set a seed, meaning that they don't have to provide a seed. How do I incorporate that option when I'm using dqrng?

Sorry if my question seems obvious to you or if I wasn't clear. Thank you in advance for your help!

Currently the sampling code in dqrng cannot be used in parallel, c.f. #26 (@jlmelville are you still interested in a PR?). As for the general design: It would be possible to create a thread local RNG and pass that by reference to your helper functions to draw from whatever distribution function you need. I will think about some simplified example.

@rstub, I am still interested in a PR, but I was hoping to have fully finished working on my downstream use case for it, so I can be sure it's fit for at least one purpose. Unfortunately that has taken a lot longer than I thought it would. But I haven't forgotten!

@rstub Thanks! Looking forward to it.

It's ok that sampling cannot be used in parallel as long as you can still draw from a uniform distribution, because then I can just write my own function for sampling with replacement.

@rstub Hi, any updates? Thanks!

Coming back to this I realize I do not fully understand the request. What problem do you have with std::ref() and std::generate()? Maybe you can post a (simplified) version of what you have tried so far?

Also, what do you mean by

Also, I want to provide the user of my program the option to set a seed, meaning that they don't have to provide a seed.

What does it mean for users to have the option to set the seed but not having to provide one?

Thanks for following up with me. I will write a simple example to show what I want to achieve this Friday. Thanks!

Thank you for encouraging me to explain myself better. I wrote a simple example (by adapting the last example here: https://cran.r-project.org/web/packages/dqrng/vignettes/parallel.html) to show what I want to make reproducible:

//#include <Rcpp.h>
//[[Rcpp::depends(dqrng,BH,sitmo)]]
#include <pcg_random.hpp>
#include <dqrng_distribution.h>
//[[Rcpp::plugins(cpp11)]]
//[[Rcpp::depends(RcppParallel)]]
#include <RcppParallel.h>

#include <RcppArmadilloExtensions/sample.h>
using namespace arma;

vec generateVector(const int n){
  mat noiseMatrix=randn(n,2);
  //I put this here because in my actual code, I need to create a matrix
  //where each entry is drawn from the standard normal distribution.

  vec sampleSpace=linspace(1,4,4); //Vector of length 4, the values being 1 through 4
  vec noiseVector=Rcpp::RcppArmadillo::sample(sampleSpace,n,true);
  //I put this here because in my actual code, I need to create a vector
  //where each entry is sampled from the integers 1 to 4 with replacement (equal probabilities).
  //Another reason why I put this here is to show that I need to make draws from multiple different distributions,
  //not just the standard normal distribution.

  vec toReturn=noiseMatrix.col(0)+noiseMatrix.col(1)+noiseVector;
  return toReturn;
}

struct RandomFill : public RcppParallel::Worker {
  RcppParallel::RMatrix<double> output;
  const int n;
  uint64_t seed;

  RandomFill(Rcpp::NumericMatrix output, const uint64_t seed)
    :output(output),n(output.nrow()),seed(seed){};

  void operator()(std::size_t begin, std::size_t end) {
    for (std::size_t colIndex = begin; colIndex < end; ++colIndex) { //Changed col to colIndex
      vec newColumn=generateVector(n); //I use an arma vector here because in my actual code, there is a lot of algebra involved so I use arma data types primarily.
      for (int obsIndex=0;obsIndex<n;obsIndex++){ //Fill the column in output. I'm using a for loop here because I don't know a better way to do it.
        output(obsIndex,colIndex)=newColumn(obsIndex);
      }
    }
  }
};

// [[Rcpp::export]]
Rcpp::NumericMatrix parallel_random_matrix(const int n, const int m, const int seed) { //Changed ncores to seed
  Rcpp::NumericMatrix res(n, m);
  RandomFill randomFill(res,seed);
  RcppParallel::parallelFor(0, m, randomFill);
  return res;
}

/*** R
res1<-parallel_random_matrix(n=7,m=3,seed=5)
res2<-parallel_random_matrix(n=7,m=3,seed=5)
res1==res2
*/

To summarize, I want to be able to do the following when using R and RcppParallel together:

  • I want to fill an arma matrix with entries that are drawn from the standard normal distribution reproducibly.
  • I want to fill an arma vector with entries sampled from 1:4 with replacement (equal probabalities) reproducibly.
  • I want the user of the R function parallel_random_matrix() to be able to supply NA as the value for seed when they don't want to set a seed.

And I'm not sure what the syntax is to achieve these. Also, when running the original example in the link provided at the beginning of this message and when running my example above, I get this error message:

Screen Shot 2020-02-07 at 4 24 19 PM

Thanks for your help!

The following is far from ideal. Just some ideas to get you started. Some interesting points:

  • For reproducibility I create an RNG for each thread with the stream defined by end.
  • I need to pass the RNG "by reference" to generateVector. This is done automatically since dqrng::rng64_t is actually a smart pointer.
  • Using dqrng::rng64_t has the advantage that I can use any of the RNGs supported by dqrng.
  • Using dqrng::rng64_t has the advantage that I can generate integers within a range very easily. This is equivalent to sampling from [0,n) with replacement and equal probabilities.
  • It is of course possible to use multiple distribution functions, see example with exponential distribution.
  • .imbue() and .transform() together with lambda functions from C++11 are very convenient.
  • It is better to use NULL for the case where no seed is supplied by the user. In that case I generate a "random" seed from R's RNG.
  • I have not tred to get rid of the for loop or to switch completly to Armadillo data structures. Both should be possible, though.
//[[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
 
//[[Rcpp::depends(dqrng,BH,sitmo)]]
#include <dqrng_distribution.h>
#include <convert_seed.h>
#include <R_randgen.h>
//[[Rcpp::plugins(cpp11)]]
//[[Rcpp::depends(RcppParallel)]]
#include <RcppParallel.h>


arma::vec generateVector(const int n, dqrng::rng64_t rng){
  arma::mat noiseMatrix(n,2);
  
  // initialize with values from first distribution
  dqrng::normal_distribution normal(0.0, 1.0);
  noiseMatrix.imbue([&]() {return normal(*rng);});
  
  // add values from second distributions
  dqrng::exponential_distribution exponential(1.0);
  noiseMatrix.transform([&](double val) {return (val + exponential(*rng));});

  // there are special methods for integers within a range 
  arma::vec noiseVector(n);
  noiseVector.imbue([&]() {return (*rng)(uint32_t(4)) + 1;});

  arma::vec toReturn=noiseMatrix.col(0) + noiseMatrix.col(1) + noiseVector;
  return toReturn;
}

struct RandomFill : public RcppParallel::Worker {
  RcppParallel::RMatrix<double> output;
  const int n;
  uint64_t seed;

  RandomFill(Rcpp::NumericMatrix output, const uint64_t seed)
    :output(output),n(output.nrow()),seed(seed){};

  void operator()(std::size_t begin, std::size_t end) {
    dqrng::rng64_t rng = dqrng::generator<dqrng::xoroshiro128plus>(seed);
    rng->seed(seed, end); // I should add a dqrng::generator<RNG>(seed, stream) ...
    
    for (std::size_t colIndex = begin; colIndex < end; ++colIndex) { //Changed col to colIndex
      arma::vec newColumn=generateVector(n, rng); //I use an arma vector here because in my actual code, there is a lot of algebra involved so I use arma data types primarily.
      for (int obsIndex=0;obsIndex<n;obsIndex++){ //Fill the column in output. I'm using a for loop here because I don't know a better way to do it.
        output(obsIndex,colIndex)=newColumn(obsIndex);
      }
    }
  }
};

// [[Rcpp::export]]
Rcpp::NumericMatrix parallel_random_matrix(const int n, const int m,
                                           Rcpp::Nullable<Rcpp::IntegerVector> seed = R_NilValue) {

  // get a seed from R's RNG in case the user did not provide one
  uint64_t _seed;
  if (seed.isNotNull())
    _seed = dqrng::convert_seed<uint64_t>(seed.as());
  else
    _seed = dqrng::convert_seed<uint64_t>(Rcpp::IntegerVector(2, dqrng::R_random_int));
  
  Rcpp::NumericMatrix res(n, m);
  RandomFill randomFill(res, _seed);
  RcppParallel::parallelFor(0, m, randomFill);
  return res;
}

/*** R
res1<-parallel_random_matrix(n=7,m=3,seed=5)
res2<-parallel_random_matrix(n=7,m=3,seed=5)
res1==res2
*/

Does this help?

BTW, I still don't understand what's your problem with std::ref and std::generate.

Concerning the compilation message: This is not an error (not even a warning) but a note. It has been fixed in more recent versions of boost, which I think are also available via the BH package.

Thank you for your explanation!

I don't have a problem with std::ref or std::generate per se. I just don't know how to use them to achieve what I want because I don't know the syntax. In fact, if the syntax is simpler than using .imbue(), I'd much rather prefer that. If you don't mind showing me how to write generateVector() with std::ref and std::generate instead of .imbue(), I'd really appreciate that.

Lastly, I see that Rcpp::IntegerVector(2, dqrng::R_random_int) returns a vector of length 2. Is that necessary or can we just generate one integer? If we can just generate one integer, what is the syntax for using dqrng::R_random_int to create one variable of type uint64_t?

Thanks again!

Actually std::generate would be quite similar to .imbue(), since both try to avoid naked for loops. IMHO it would be a good idea to learn about such methods, starting with the algorithm header from the STL. Anyway, here is a version of generateVector() that uses for loops instead:

arma::vec generateVector(const int n, dqrng::rng64_t rng){
    arma::mat noiseMatrix(n,2);
    
    // initialize with values from normal distribution
    dqrng::normal_distribution normal(0.0, 1.0);
    for (int i = 0; i < n; ++i)
        for (int j = 0; j < 2; ++j)
            noiseMatrix(i, j) = normal(*rng);
    
    // there are special methods for integers within a range 
    arma::vec noiseVector(n);
    for (int i = 0; i < n; ++i)
        noiseVector(i) = (*rng)(uint32_t(4)) + 1;
    
    arma::vec toReturn=noiseMatrix.col(0) + noiseMatrix.col(1) + noiseVector;
    return toReturn;
}

As for the seed: I take two draws from R's RNG since a single draw gives only 32bits of randomness, while the RNGs want 64bits (at least). If you are ok with not using the full space of possible seeds, you can also use:

    // get a seed from R's RNG in case the user did not provide one
    uint64_t _seed;
    if (seed.isNotNull())
        _seed = dqrng::convert_seed<uint64_t>(seed.as());
    else
        _seed = dqrng::R_random_u32();

Thank you very much for your explanation! I think I understand a lot better now.