vertica / ddR

Standard API for Distributed Data Structures in R

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DESIGN: will ddR support implicit use of global variables?

clarkfitzg opened this issue · comments

The changes in #15 brought this bug out. Global variables are not exported to a PSOCK cluster. This causes the kmeans example to fail. A minimal example:

library(ddR)

globvar <- 100
f <- function(x) globvar

useBackend(parallel, type="FORK")

# Works fine
dl <- dlist(1:10, letters)
collect(dlapply(dl, f))

useBackend(parallel, type="PSOCK")

# Fails to find globvar
dl <- dlist(1:10, letters)
collect(dlapply(dl, f))

So I think we should make a call for how ddR should work for portability. Here's what I see as the options:

  1. Only allow pure functions The simplest approach
  2. Add a parameter to pass an environment where the function is to be evaluated Supported by Spark and parallel backends
  3. Programmatically gather function dependencies from the code SparkR does this

Right now 2) is the most appealing, because it's clear what's happening. 1) would be not enough- for example I often compose a large function out of several small functions. 3) is appealing, but is significantly more complex.

We should just do 3. Yes, it's complicated, but many other frameworks, like foreach, etc, have managed to get it to work. It's the behavior the R user expects.

FYI 1: I handle globals automatically in future with help of the globals package which does the heavy lifting. The idea is that the globals package defines how globals are identified and gathered (supporting different strategies for different purposes. e.g. codetools is mostly for R CMD check != export globals for parallel processing). There are a few corner cases that needs to be fixed - shouldn't be hard (mostly time). I haven't put much efforts of finalizing / freezing the globals API itself, but if you going down that path I can work with you on this.

FYI 2: It didn't take long before I got feedback / wishes to add support for manually controlling globals as an alternative (HenrikBengtsson/future#84), so I've recently added support for that too in future. Some of this new code might be migrated to the globals package.

PS. I think foreach only handles globals at the R prompt / global environment(?), but as soon as you start using foreach() within functions you need to specify globals explicitly.

Interesting packages @HenrikBengtsson, I'll give those a try. A list of variable names seems friendlier than an environment, but environments work nicely with do.call.

The workhorse for SparkR is their processClosure function.

Seems like we could support both 2 and 3 through adding an argument roughly like

dlapply(..., env = NULL){
    if(is.null(env)) env <- fancy_environment_maker(...)
   ...
}