thoughts on a cached version of use() ?

Question

thoughts on a cached version of use() ?

klin333 opened this issue 2 years ago · comments

Hi,

modules::use() could get quite slow with large R scripts, especially when there are multiple layers of nested modules in modules.

Any thoughts on a cached version of use that returns the module from a package wide cache?

I had a play around here klin333@2036821

Sebastian Warnholz · Answer 1 · Tue Jun 28 2022 04:14:22 GMT+0800 (China Standard Time)

Hi. Thank you very much for your suggestion. I guess it would be possible, to implement a caching mechanism for modules.

Can you elaborate a bit on the use case? Is this for development, or do you need it as a user? I guess I don't understand why you would rerun use, when in fact your module is already there. Is it because modules are nested? Why are those nested modules taking so long to compile, do they load data, or do they make computations?

If caching is really a solution, I see different ways of dealing with it. One way is to use a strategy like memoisation.

use_cached <- memoise::memoise(modules::use)

Would be very similar to your implementation.

Like library, we may add an option to not recompile a module, if it is already attached.

Like packages, we may need to think about distributing modules as binaries, precompiled, where I would try to convince everyone that in fact we just need a package.

klin333 · Answer 2 · Tue Jun 28 2022 08:06:50 GMT+0800 (China Standard Time)

Use cases include both development (when needing to re generate module after changing the module source code), and as a user, due to nested modules, eg wish to create a module from outer.R where outer.R itself loads modules eg mod_inner <- modules::use('inner.R').

The modules take a while (i'm talking about 4 seconds being considered long, since if this is an inner module nested within layers of outer modules, 4 seconds quickly cascades up.) I did a quick profvis profiling, seems importing of dplyr can take a while, around 1-2 seconds (though I haven't tried on latest versions of dplyr, only tried dplyr_0.8.5). Additionally, I was creating some bizday objects in inner.R that took another 2 seconds. It's difficult to memoise these object creations within inner.R across module use() calls on the same inner.R script, as memoise has to somehow access a global cache.

In terms of implementation, memoise would be nice, though have to do some minor customisation to include the R script modified time as an input.

But yes, i do see that for more general uses of modules package, you have to be a lot more careful with cache invalidation. It's similar to python modules where the only truely safe way of reloading a cached python module is to restart the interpreter. Though do note that even with such clunky cache invalidation, python still always cache its modules. Perhaps this package can have a use_cache function that is separate from the usual use, restrict to R scripts only, and come with a giant user-beware.

Restarting R session...

* Project 'C:/Users/User/workspace/Models' loaded. [renv 0.13.2]
* The project may be out of sync -- use `renv::status()` for more details.
> 
> tictoc::tic(); suppressPackageStartupMessages(modules::import('dplyr')); tictoc::toc()
2.14 sec elapsed

Sebastian Warnholz · Answer 3 · Tue Jul 05 2022 16:10:17 GMT+0800 (China Standard Time)

@klin333 sorry for not getting back to you. It's quite busy around here and I won't be available to spend any time on this during July, but I would like to pick up again in August and see if we can find a solution.

Just some thoughts. In a fresh R session:

We see that it takes some time to do an import. However I have never seen 2 sec imports, like in your example.

> tictoc::tic(); suppressPackageStartupMessages(modules::import('dplyr')); tictoc::toc()
0.541 sec elapsed

This is not something we can cache. Even if you do a call to library you will get similar timings:

> tictoc::tic(); suppressPackageStartupMessages(modules::import('dplyr')); tictoc::toc()
0.456 sec elapsed

What we observe here is only the time it takes to load the namespace of dplyr, plus some other things happening in library, but that is negligible. Attaching a namespace is already cached, or only happening once in a rsession, unless you unload them (which we don't do in modules):

> tictoc::tic(); suppressPackageStartupMessages(modules::import('dplyr')); suppressPackageStartupMessages(modules::import('dplyr')); tictoc::toc()
0.549 sec elapsed

So calling this a second time, does not cost much additional resources.

That is why I think that we should have a very clear picture of what we really need to cache. Usually these things do cause a headache and tend to get complicated. Anyway, happy to work on this.

klin333 · Answer 4 · Fri Jul 29 2022 14:16:06 GMT+0800 (China Standard Time)

yeah library loading is part of the problem. but also time consuming work done in the scripts themselves, illustrative examples below. these work can be done once and cached, as far as the module is concerned. you can see how with nested modules, the same time consuming cachable work cascades to a very long time, very quickly.

eg mod_a.R

foo <- function(x) {
  # imagine a bunch of operations
  1
}

deriv <- Deriv::Deriv(foo, 'x') # imagine some operation takes a while

lookup <- purrr::map(seq(100), function(i) as.Date(paste0(seq(0, 2000), '-01-01'))) # imagine some operations that take a while

mod_b.R

mod_a <- modules::use('mod_a.R')

mod_c.R

mod_a <- modules::use('mod_a.R')
mod_b <- modules::use('mod_b.R')

klin333 · Answer 5 · Mon Aug 22 2022 09:47:39 GMT+0800 (China Standard Time)

anyway, cached modules is super easy if you don't care about separating the function enclosing environments. separating the environments for cached copies of modules is a lot more complicated, but i believe i've done that in my fork. anyway, if too convoluted for the package, it's fine to leave it, i'm perfectly happy using my own fork.