dbdahl / rscala

The Scala interpreter is embedded in R and callbacks to R from the embedded interpreter are supported. Conversely, the R interpreter is embedded in Scala.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Is it possible to connect to already instantiated R session?

tmichniewski opened this issue · comments

Hello David @dbdahl,

We are experimenting with RScala library on Azure Databricks. The processing is organised in that way that firstly we perform some initialization on the driver where we produce some preprocessed data, which then are passed/broadcasted to workers, where we would like to perform another part of the computation using Spark User Defined Function (UDF) which uses RClient and performs some additional processing using R. Basically it works well except that each call to UDF needs to instantiate a new RClient session, which has to be created and initialized and then we need to push there those preprocessed data we prepared on the driver.

So, as you may imagine what is being processed within the UDF consists of several operations which are the same before UDF starts to do something really important for the given record of Dataframe. It turns out that the cost of instantiation of RClient and then cost of creation of those preprocessed data within R session is significant.

That is where we come to my question. I mean - is it possible to somehow connect to already running R session? So, for example imagine that we have let say 16 tasks on eache worker, so we have always up to 16 UDFs being processed at the same time. But instead of creating a new RClient for each record, maybe it is possible to somehow create and initialize those 16 sessions with R using some preprocessing (for example somehow initialize 16 RClient sessions and identify them like some 16-element collection of RClient objects). Such initialization would be performed only once per processing and would also contain instantiation of all the preprocessed data structures.

Then during real processing we would be able to refer to those already created RClients/R sessions. For example in UDF insted of

val r = RClient()
// here the initialization of R variables in R session
// and only here we start the real R processing
r.quit() // finally close the object

we would use something like this:

val r = RClient(some_identifier_of_one_of_those_16_sessions)
// SKIPPED - since we connect to existing session, we skip the initialization
// the real processing
// r.quit() SKIPPED - we could also skip the quit method as we leave the object for further processing

Of course I am aware that probably we should pay attention to not use the same RClient object by more then one task at the same time because it does not have to be thread safe, but this is why I thought of let say 16 of such objects.

The key question is whether this is possible? I mean to connect to already created RClient session.

In my opinion it would be sufficient if we could instantiate RClient session, read from it some kind of its identifier and then be able to connect to this session using this identifier. So something like this:

val r = RClient()
// initialize some veriables in R session
val id = r.getSessionId // get the id of this session

// this should be possible to do many times:
while (some_condition) {
  val r2 = RClient(id) // connect to the existing session, potentially from another task/thread, but on the same machine
  // do some computation on R through r2
  // leave r2 session, but not with quit(), as we still need this session for another iteration
}

// finally close the original session
r.quit()

Of course on Spark we do not have those while loop, the processing is handled by Spark. This was just to present the idea.

This is not currently implemented, although I don't see any technical reasons why it could not be implemented. I don't personally have plans to implement new features but would be happy to answer questions if you (@tmichniewski) want to dive into the code.

I will think of it. Thank you for the comment.