vertica / ddR

Standard API for Distributed Data Structures in R

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Fork backend potential issues - external pointers

dselivanov opened this issue · comments

Looking to fork_driver.R we can see that dmapply is essentially mcmapply. So we rely on the fact that every object can be easily lazily copied from master to worker process. But we are missing the fact, that it is possible that elements of dlist can be objects which keep some data out of R's heap in external pointers.

I think the child processes still share that memory with the parent process.

Actually at the moment things are more complicated/weird even for parallel.fork driver:

library(ddR)
library(Rcpp)
library(parallel)

cppFunction(
"
SEXP init_std_vector(IntegerVector x) {
  std::vector<int> *xx = new std::vector<int>(x.size());
  for(int i = 0; i < x.size(); i++)
    xx->at(i) = x[i];
  XPtr< std::vector<int> > ptr(xx, true);
  return ptr;
}
")

cppFunction(
"  
IntegerVector get_std_vector(SEXP ptr) {
  XPtr< std::vector<int> > vec(ptr);
  return wrap(*vec);
}
"
)
ddr = useBackend("parallel", 2, type = "FORK")
v1 = dmapply(function(x) init_std_vector(1:10), list(1, 2), output.type = "dlist", combine = "c")
v2 = dmapply(function(x) get_std_vector(x), v1, output.type = "dlist", combine = "c")

Error in checkForRemoteErrors(val) :
2 nodes produced errors; first error: external pointer is not valid

I think this is because:

collect(v1)

[[1]]<pointer: 0x0>
[[2]]<pointer: 0x0>

Drive-by comment regarding #25 (comment):

Yes, this is because these on-the-fly compiled Rcpp objects hold external-pointer references making these objects non-exportable/non-serializable. FWIW, this is mentioned in https://cran.r-project.org/web/packages/future/vignettes/future-4-non-exportable-objects.html.

The solution is to not compile on-the-fly and instead put Rcpp code in a package.