daqana / dqrng

Fast Pseudo Random Number Generators for R

Home Page:https://daqana.github.io/dqrng/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Enable parallel usage of templates used for dqsample

isezen opened this issue · comments

So, is it thread safe? can we use, for instance, dqrunif in a RcppParallel::worker? Or should we follow the examples at https://cran.r-project.org/web/packages/dqrng/vignettes/parallel.html?

The functions provided in the C++ API are not thread safe since they are provided via R's C API, e.g.:

inline Rcpp::NumericVector dqrunif(size_t n, double min = 0.0, double max = 1.0) {
typedef SEXP(*Ptr_dqrunif)(SEXP,SEXP,SEXP);
static Ptr_dqrunif p_dqrunif = NULL;
if (p_dqrunif == NULL) {
validateSignature("Rcpp::NumericVector(*dqrunif)(size_t,double,double)");
p_dqrunif = (Ptr_dqrunif)R_GetCCallable("dqrng", "_dqrng_dqrunif");
}
RObject rcpp_result_gen;
{
rcpp_result_gen = p_dqrunif(Shield<SEXP>(Rcpp::wrap(n)), Shield<SEXP>(Rcpp::wrap(min)), Shield<SEXP>(Rcpp::wrap(max)));
}
if (rcpp_result_gen.inherits("interrupted-error"))
throw Rcpp::internal::InterruptedException();
if (Rcpp::internal::isLongjumpSentinel(rcpp_result_gen))
throw Rcpp::LongjumpException(rcpp_result_gen);
if (rcpp_result_gen.inherits("try-error"))
throw Rcpp::exception(Rcpp::as<std::string>(rcpp_result_gen).c_str());
return Rcpp::as<Rcpp::NumericVector >(rcpp_result_gen);
}

Please follow the examples from the parallel vignette instead. Do you have any questions w.r.t. these examples?

Thank you very much for asking. Then,

double dqrng::runif(double min = 0.0, double max = 1.0);
double dqrng::rnorm(double mean = 0.0, double sd = 1.0);

functions are also not thread safe (pity). I need thread safe equivalent of rnorm, sampleand runif functions to create a random forest with RcppParallel. Actually my implementation is very similar to [1]. Can you give an example how to create and use rnorm, sampleand runif function in a RcppParallel::worker? This would be very handy.

1- https://www.daqana.org/dqrng/articles/parallel.html#pcg-multiple-streams-with-rcppparallel

Correct, dqrng::runif and dqrng::rnorm are not thread safe either. However, these should be fairly simple to achieve, e.g. given the existing code in the RcppParallel example, dist(rng) will give you a normally distributed random variate. And you can define an additional uniform distribution as well, c.f. the last example in the default vignette, which actually originated in a parallel application (c.f. #13 and the SO question referenced there).

A parallel alternative to sample is more difficult, though, since the current implementation of dqsample is only available in the compiled library. I am changing the title of this issue to reflect this (reasonable) enhancement request.

I had looked into the examples that you suggested. For Now, I will stick to serial dqrng methods for testing purposes till you make an enhancement, because other methods won't help me much without a thread-safe sample method.
Thank you very much for the package. Community really need thread safe statistical methods to make things faster.

Hello Ralf,
I couldn't help myself and tried to create a parallel sampling by the help of your codes. The code below is modified version of PCG: multiple streams with RcppParallel and fills the columns randomly with shuffled indices of rows. It seems working smoothly. Perhaps, you give me some advice how to improve the code. Do you?

// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(RcppParallel)]]
// [[Rcpp::depends(dqrng, BH, sitmo)]]
#include <Rcpp.h>
#include <RcppParallel.h>
#include <R_randgen.h>
#include <convert_seed.h>
#include <dqrng_generator.h>
using namespace Rcpp;

struct par_sample : public RcppParallel::Worker {
  RcppParallel::RMatrix<double> output;
  uint64_t seed;

  par_sample(Rcpp::NumericMatrix output, const uint64_t seed)
    : output(output), seed(seed) {};

  std::vector<uint32_t> no_replacement_shuffle(pcg64 &rng, uint32_t m, uint32_t n) {
    std::vector<uint32_t> tmp(m);
    std::iota(tmp.begin(), tmp.end(), static_cast<uint32_t>(0));
    for (uint32_t i = 0; i < n; ++i) std::swap(tmp[i], tmp[i + (rng)(m - i)]);
    if (m == n) return(tmp);
    return(std::vector<uint32_t>(tmp.begin(), tmp.begin() + n));
  }

  void operator()(std::size_t begin, std::size_t end) {
    int n = output.nrow();
    pcg64 rng(seed, end);
    for (std::size_t col = begin; col < end; ++col) {
      RcppParallel::RMatrix<double>::Column column = output.column(col);
      std::vector<uint32_t> v = no_replacement_shuffle(rng, n, n);
      for (std::size_t i = 0; i < n; i++) column[i] = v[i];
    }
  }
};

// [[Rcpp::export]]
Rcpp::NumericMatrix par_samp_mat(const int n, const int m) {
  // m = n_columns, n = n_rows
  Rcpp::NumericMatrix res(n, m);
  Rcpp::RNGScope rngScope;
  Rcpp::IntegerVector seed(2, dqrng::R_random_int);
  uint64_t s = dqrng::convert_seed<uint64_t>(seed);
  par_sample ps(res, s);
  RcppParallel::parallelFor(0, m, ps);
  return res;
}

/*** R
nc <- 10; nr <- 10;
set.seed(42)
a <- par_samp_mat(nr, nc)
b <- par_samp_mat(nr, nc)
all(apply(a, 2, function(x) all.equal(x, unique(x)))) # must be TRUE
all(colSums(a) == sum(0:(nr - 1))) # must be TRUE
all(a == b) # must be FALSE
*/

Do you need to shuffle or sample the data? The function no_replacement_shuffle supports sampling by partial shuffling, but you currently only use full shuffling. If full shuffling is all you need, you probably can simplify things a bit.

Independent of that:

  1. You don't need an explicit Rcpp::RNGScope if your function uses [[Rcpp::export]].

  2. The function no_replace_shuffle does not need to be a member function.

  3. The integer in a range operator in the PCG functions is nice, but AFAIK the nearly division-less algorithm in dqrng:: random_64bit_wrapper is faster (and can be combined with the faster Xo(ro)shiro RNGs). You could use something like:

     dqrng::rng64_t rng = std::make_shared<dqrng::random_64bit_wrapper<pcg64>>();
     rng->seed(seed, end);
    

    together with no_replacement_shuffle(dqrng::rng64_t &rng, uint32_t m, uint32_t n). I probably should provide a suitable overload for generator(seed, stream).

Thanks Ralf, I modified example according to your suggestions. I need to sample data by partial shuffling with no replacement. As this is an example, for simplicity, I selected n elements from n (m) in parallel worker. Thank you also for 3th suggestion. So, can I use other random generators in parallel worker? Forgive my ignorance, which generator is much more random or suitable for parallel computing? Or should I use only pcg64 for parallel computing?

// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(RcppParallel)]]
// [[Rcpp::depends(dqrng, BH, sitmo)]]
#include <Rcpp.h>
#include <RcppParallel.h>
#include <R_randgen.h>
#include <convert_seed.h>
#include <dqrng_generator.h>
using namespace Rcpp;

std::vector<uint32_t> no_replacement_shuffle(dqrng::rng64_t &rng, uint32_t m,
                                             uint32_t n) {
  std::vector<uint32_t> tmp(m);
  std::iota(tmp.begin(), tmp.end(), static_cast<uint32_t>(0));
  for (uint32_t i = 0; i < n; ++i) std::swap(tmp[i], tmp[i + (*rng)(m - i)]);
  if (m == n) return(tmp);
  return(std::vector<uint32_t>(tmp.begin(), tmp.begin() + n));
}

struct par_sample : public RcppParallel::Worker {
  uint64_t seed;
  RcppParallel::RMatrix<double> output;

  par_sample(Rcpp::NumericMatrix output, const uint64_t seed)
    : seed(seed), output(output) {};

  void operator()(std::size_t begin, std::size_t end) {
    dqrng::rng64_t rng = std::make_shared<dqrng::random_64bit_wrapper<pcg64>>();
    rng->seed(seed, end);
    int n = output.nrow();
    for (std::size_t col = begin; col < end; ++col) {
      RcppParallel::RMatrix<double>::Column column = output.column(col);
      std::vector<uint32_t> v = no_replacement_shuffle(rng, n, n);
      for (std::size_t i = 0; i < n; i++) column[i] = v[i];
    }
  }
};

// [[Rcpp::export]]
Rcpp::NumericMatrix par_samp_mat(const int n, const int m) {
  // m = n_columns, n = n_rows
  Rcpp::NumericMatrix res(n, m);
  Rcpp::IntegerVector seed(2, dqrng::R_random_int);
  uint64_t s = dqrng::convert_seed<uint64_t>(seed);
  par_sample ps(res, s);
  RcppParallel::parallelFor(0, m, ps);
  return res;
}

/*** R
nc <- 10; nr <- 10;
set.seed(42)
a <- par_samp_mat(nr, nc)
b <- par_samp_mat(nr, nc)
all(apply(a, 2, function(x) all.equal(x, unique(x)))) # must be TRUE
all(colSums(a) == sum(0:(nr - 1))) # must be TRUE
all(a == b) # must be FALSE
*/

If you need to sample, then partial shuffling is performant if you select more than 50% of the population. For smaller selection ratios I found rejection sampling to be more performant. The difference is measurable even for small population sizes, but might be negligible.

As for parallel RNGs: dqrng::random_64bit_wrapper<RNG>::seed(seed, stream) is supported for the following RNGs, which are all suitable for parallel computations:

  • dqrng::xoroshiro128plus and dqrng::xoshiro256plus
  • pcg64
  • sitmo::threefry_20_64

FWIW, I would also be very interested in a thread-safe version or equivalent of dqrng::dqsample_int.

In case anyone else needs this, here is the sample code without a global rng, with @isezen's example code re-implemented and making the kind of the RNG an option. Note:

  • Anyone adding this code directly to a package will need to add LinkingTo: sitmo, BH to the DESCRIPTION.
  • Order of import is really important. The RcppParallel include must go after the dqrng ones or it won't compile (seems to be a problem with an In or Out type being defined).
  • If you want to use std::vector<uint_32t> and the like directly instead of Rcpp vectors, some extra templating would be required. Edited to add: I neglected to add this extra templating when I originally posted this, thus guaranteeing the R session terminating if you tried to run this code multi-threaded. The par_samp_mat example is now fixed in the most inelegant way possible. But you get the general idea.

@rstub, would you be interested in a PR around this? It's basically just a copy-and-paste with rng made a parameter, but it seems like there would be further higher-level reorganization required to allow both the global and non-global rng, so it would need a bit of discussion.

#include <Rcpp.h>

// [[Rcpp::depends(dqrng, BH)]]
#include "R_randgen.h"
#include "convert_seed.h"
#include "dqrng_generator.h"
#include "minimal_int_set.h"

// [[Rcpp::depends(RcppParallel)]]
#include <RcppParallel.h>

inline std::vector<uint32_t>
no_replacement_shuffle_vec(dqrng::rng64_t &rng, uint32_t m, uint32_t n) {
  std::vector<uint32_t> tmp(m);
  std::iota(tmp.begin(), tmp.end(), static_cast<uint32_t>(0));
  for (uint32_t i = 0; i < n; ++i)
    std::swap(tmp[i], tmp[i + (*rng)(m - i)]);
  if (m == n)
    return (tmp);
  return (std::vector<uint32_t>(tmp.begin(), tmp.begin() + n));
}


template <int RTYPE, typename INT>
inline Rcpp::Vector<RTYPE> replacement(dqrng::rng64_t &rng, INT m, INT n,
                                       int offset) {
  using storage_t = typename Rcpp::traits::storage_type<RTYPE>::type;
  Rcpp::Vector<RTYPE> result(Rcpp::no_init(n));
  std::generate(result.begin(), result.end(), [=, &rng]() {
    return static_cast<storage_t>(offset + (*rng)(m));
  });
  return result;
}

template <int RTYPE, typename INT>
inline Rcpp::Vector<RTYPE> no_replacement_shuffle(dqrng::rng64_t &rng, INT m,
                                                  INT n, int offset = 0) {
  using storage_t = typename Rcpp::traits::storage_type<RTYPE>::type;
  Rcpp::Vector<RTYPE> tmp(Rcpp::no_init(m));
  std::iota(tmp.begin(), tmp.end(), static_cast<storage_t>(offset));
  for (INT i = 0; i < n; ++i) {
    std::swap(tmp[i], tmp[i + (*rng)(m - i)]);
  }
  if (m == n)
    return tmp;
  else
    return Rcpp::Vector<RTYPE>(tmp.begin(), tmp.begin() + n);
}

template <int RTYPE, typename INT, typename SET>
inline Rcpp::Vector<RTYPE> no_replacement_set(dqrng::rng64_t &rng, INT m, INT n,
                                              int offset) {
  using storage_t = typename Rcpp::traits::storage_type<RTYPE>::type;
  Rcpp::Vector<RTYPE> result(Rcpp::no_init(n));
  SET elems(m, n);
  for (INT i = 0; i < n; ++i) {
    INT v = (*rng)(m);
    while (!elems.insert(v)) {
      v = (*rng)(m);
    }
    result(i) = static_cast<storage_t>(offset + v);
  }
  return result;
}

template <int RTYPE, typename INT>
inline Rcpp::Vector<RTYPE> sample(dqrng::rng64_t &rng, INT m, INT n,
                                  bool replace = false, int offset = 0) {
  if (replace || n <= 1) {
    return replacement<RTYPE, INT>(rng, m, n, offset);
  } else {
    if (!(m >= n))
      Rcpp::stop("Argument requirements not fulfilled: m >= n");
    if (m < 2 * n) {
      return no_replacement_shuffle<RTYPE, INT>(rng, m, n, offset);
    } else if (m < 1000 * n) {
      return no_replacement_set<RTYPE, INT, dqrng::minimal_bit_set>(rng, m, n,
                                                                    offset);
    } else {
      return no_replacement_set<RTYPE, INT, dqrng::minimal_hash_set<INT>>(
          rng, m, n, offset);
    }
  }
}

template <typename RNGtype> struct par_sample : public RcppParallel::Worker {
  uint64_t seed;
  RcppParallel::RMatrix<double> output;

  par_sample(Rcpp::NumericMatrix output, const uint64_t seed)
      : seed(seed), output(output){};

  void operator()(std::size_t begin, std::size_t end) {
    dqrng::rng64_t rng =
        std::make_shared<dqrng::random_64bit_wrapper<RNGtype>>();
    rng->seed(seed, end);
    auto n = static_cast<std::size_t>(output.nrow());
    for (std::size_t col = begin; col < end; ++col) {
      auto column = output.column(col);
      auto v = no_replacement_shuffle_vec(rng, n, n);
      for (std::size_t i = 0; i < n; i++)
        column[i] = v[i];
    }
  }
};

// [[Rcpp::export]]
Rcpp::NumericMatrix par_samp_mat(const int n, const int m,
                                 std::string kind = "pcg64") {
  // m = n_columns, n = n_rows
  Rcpp::NumericMatrix res(n, m);
  Rcpp::IntegerVector seed(2, dqrng::R_random_int);
  auto s = dqrng::convert_seed<uint64_t>(seed);

  for (auto &c : kind) {
    c = std::tolower(c);
  }
  if (kind == "xoroshiro128+") {
    par_sample<dqrng::xoroshiro128plus> ps(res, s);
    RcppParallel::parallelFor(0, m, ps);
  } else if (kind == "xoshiro256+") {
    par_sample<dqrng::xoshiro256plus> ps(res, s);
    RcppParallel::parallelFor(0, m, ps);
  } else if (kind == "pcg64") {
    par_sample<pcg64> ps(res, s);
    RcppParallel::parallelFor(0, m, ps);
  } else if (kind == "threefry") {
    par_sample<sitmo::threefry_20_64> ps(res, s);
    RcppParallel::parallelFor(0, m, ps);
  } else {
    Rcpp::stop("Unknown parallel random generator kind: %s", kind);
  }

  return res;
}

Thank God I'm not the only one with nothing better to do today...

If the proposal is to include the RcppParallel code in dqrng header libraries, it would be good to add some macros so that downstream packages aren't obliged to link to RcppParallel if they don't need to use this functionality. (The same effect could be achieved by just having this parallel code in a separate file, but it's good to be explicit about this, especially as source code gets rearranged over time.)

If the proposal is to include the RcppParallel code in dqrng itself... as we've found out, this opens the door to various installation difficulties that hit uwot when people have messages in their .Rprofile. I can already foresee bug reports coming my way given the dependence of many of my packages on dqrng. This isn't a big deal if there are clear benefits... but does anyone really need an R interface to parallel sampling? Seems like most people here are using this in C++ anyway.

Happy thanksgiving @LTLA! The proposal would only to be to provide versions of sample that take rng as a parameter. The RcppParallel stuff is just an example of why you would want those versions of sample.

Thanks @jlmelville! I have not looked at your code in detail, but I am open to a PR. Note that I would prefer to have only one version of the sampling code, i.e. dqrng::dqsample_int should then use the version with rng as a parameter.

@isezen
In #47 I have also moved the sampling code to a separate header file. I have not tried using this in a parallel context yet, though.

@isezen
See

# PCG: multiple streams with RcppParallel
for an example of parallel sampling.

Actually this does not work reliably since Rcpp::Vector is used as return type, i.e. a data structure which is controlled by R.

fixed in #70